Thread: Linux: more cores = less concurrency.
Hi Guys, I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql functionto allocate seats) concurrently. We do use connection pooling and split out some selects to slony slaves, but thetests here are primeraly to test what an individual server is capable of. The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on fewercores. Has anyone else seen this behaviour? I'm guessing this is either a hardware limitation or something to do with linux processmanagement / scheduling? Any idea what to look into? My benchmark utility is just using a little .net/npgsql app that runs increacing numbers of clients concurrently, each clientruns a specified number of iterations of any sql I specify. I've posted some results and the test program here: http://www.8kb.co.uk/server_benchmarks/
Glyn Astill <glynastill@yahoo.co.uk> wrote: > The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz Which has hyperthreading. > our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. Which doesn't have hyperthreading. PostgreSQL often performs worse with hyperthreading than without. Have you turned HT off on your new machine? If not, I would start there. -Kevin
On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Glyn Astill <glynastill@yahoo.co.uk> wrote: > >> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz > > Which has hyperthreading. > >> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. > > Which doesn't have hyperthreading. > > PostgreSQL often performs worse with hyperthreading than without. > Have you turned HT off on your new machine? If not, I would start > there. And then make sure you aren't running CFQ. JD > > -Kevin -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
--- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote: > From: Joshua D. Drake <jd@commandprompt.com> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > Cc: pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk> > Date: Monday, 11 April, 2011, 19:12 > On Mon, 11 Apr 2011 13:09:15 -0500, > "Kevin Grittner" > <Kevin.Grittner@wicourts.gov> > wrote: > > Glyn Astill <glynastill@yahoo.co.uk> > wrote: > > > >> The new server uses 4 x 8 core Xeon X7550 CPUs at > 2Ghz > > > > Which has hyperthreading. > > > >> our current servers are 2 x 4 core Xeon E5320 CPUs > at 2Ghz. > > > > Which doesn't have hyperthreading. > > Yep, off. If you look at the benchmarks I took, HT absoloutely killed it. > > PostgreSQL often performs worse with hyperthreading > than without. > > Have you turned HT off on your new machine? If > not, I would start > > there. > > And then make sure you aren't running CFQ. > > JD > Not running CFQ, running the no-op i/o scheduler.
On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner" > <Kevin.Grittner@wicourts.gov> wrote: >> Glyn Astill <glynastill@yahoo.co.uk> wrote: >> >>> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz >> >> Which has hyperthreading. >> >>> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. >> >> Which doesn't have hyperthreading. >> >> PostgreSQL often performs worse with hyperthreading than without. >> Have you turned HT off on your new machine? If not, I would start >> there. > > And then make sure you aren't running CFQ. > > JD This++ Also if you're running a good hardware RAID controller, jsut go to NOOP
On Mon, Apr 11, 2011 at 12:23 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > > > --- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote: > >> From: Joshua D. Drake <jd@commandprompt.com> >> Subject: Re: [PERFORM] Linux: more cores = less concurrency. >> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> >> Cc: pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk> >> Date: Monday, 11 April, 2011, 19:12 >> On Mon, 11 Apr 2011 13:09:15 -0500, >> "Kevin Grittner" >> <Kevin.Grittner@wicourts.gov> >> wrote: >> > Glyn Astill <glynastill@yahoo.co.uk> >> wrote: >> > >> >> The new server uses 4 x 8 core Xeon X7550 CPUs at >> 2Ghz >> > >> > Which has hyperthreading. >> > >> >> our current servers are 2 x 4 core Xeon E5320 CPUs >> at 2Ghz. >> > >> > Which doesn't have hyperthreading. >> > > > Yep, off. If you look at the benchmarks I took, HT absoloutely killed it. > >> > PostgreSQL often performs worse with hyperthreading >> than without. >> > Have you turned HT off on your new machine? If >> not, I would start >> > there. >> >> And then make sure you aren't running CFQ. >> >> JD >> > > Not running CFQ, running the no-op i/o scheduler. Just FYI, in synthetic pgbench type benchmarks, a 48 core AMD Magny Cours with LSI HW RAID and 34 15k6 Hard drives scales almost linearly up to 48 or so threads, getting into the 7000+ tps range. With SW RAID it gets into the 5500 tps range.
--- On Mon, 11/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Just FYI, in synthetic pgbench type benchmarks, a 48 core > AMD Magny > Cours with LSI HW RAID and 34 15k6 Hard drives scales > almost linearly > up to 48 or so threads, getting into the 7000+ tps > range. With SW > RAID it gets into the 5500 tps range. > I'll have to try with the synthetic benchmarks next then, but somethings definately going off here. I'm seeing no disk activityat all as they're selects and all pages are in ram. I was wondering if anyone had any deeper knowledge of any kernel tunables, or anything else for that matter. A wild guess is something like multiple cores contending for cpu cache, cpu affinity, or some kind of contention in the kernel,alas a little out of my depth. It's pretty sickening to think I can't get anything else out of more than 8 cores.
On 04/11/2011 02:32 PM, Scott Marlowe wrote:
Anyone know the reason for that?On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:Glyn Astill <glynastill@yahoo.co.uk> wrote:The new server uses 4 x 8 core Xeon X7550 CPUs at 2GhzWhich has hyperthreading.our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.Which doesn't have hyperthreading. PostgreSQL often performs worse with hyperthreading than without. Have you turned HT off on your new machine? If not, I would start there.
And then make sure you aren't running CFQ. JDThis++ Also if you're running a good hardware RAID controller, jsut go to NOOP
--
Stephen Clark
NetWolves
Sr. Software Engineer III
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
http://www.netwolves.com
Stephen Clark
NetWolves
Sr. Software Engineer III
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
http://www.netwolves.com
On 2011-04-11 21:42, Glyn Astill wrote: > > I'll have to try with the synthetic benchmarks next then, but somethings definately going off here. I'm seeing no diskactivity at all as they're selects and all pages are in ram. Well, if you dont have enough computations to be bottlenecked on the cpu, then a 4 socket system is slower than a comparative 2 socket system and a 1 socket system is even better. If you have a 1 socket system, all of your data can be fetched from "local" ram seen from you cpu, on a 2 socket, 50% of your accesses will be "way slower", 4 socket even worse. So the more sockets first begin to kick in when you can actually use the CPU's or add in even more memory to keep your database from going to disk due to size. -- Jesper
--- On Mon, 11/4/11, david@lang.hm <david@lang.hm> wrote: > From: david@lang.hm <david@lang.hm> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: "Steve Clark" <sclark@netwolves.com> > Cc: "Scott Marlowe" <scott.marlowe@gmail.com>, "Joshua D. Drake" <jd@commandprompt.com>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>,pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk> > Date: Monday, 11 April, 2011, 21:04 > On Mon, 11 Apr 2011, Steve Clark > wrote: > > the limit isn't 8 cores, it's that the hyperthreaded cores > don't work well with the postgres access patterns. > This has nothing to do with hyperthreading. I have a hyperthreaded benchmark purely for completion, but can we please forgetabout it. The issue I'm seeing is that 8 real cores outperform 16 real cores, which outperform 32 real cores under high concurrency. 32 cores is much faster than 8 when I have relatively few clients, but as the number of clients is scaled up 8 cores winsoutright. I was hoping someone had seen this sort of behaviour before, and could offer some sort of explanation or advice.
On Mon, 11 Apr 2011, Steve Clark wrote: > On 04/11/2011 02:32 PM, Scott Marlowe wrote: >> On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake<jd@commandprompt.com> >> wrote: >>> On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner" >>> <Kevin.Grittner@wicourts.gov> wrote: >>>> Glyn Astill<glynastill@yahoo.co.uk> wrote: >>>> >>>>> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz >>>> Which has hyperthreading. >>>> >>>>> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. >>>> Which doesn't have hyperthreading. >>>> >>>> PostgreSQL often performs worse with hyperthreading than without. >>>> Have you turned HT off on your new machine? If not, I would start >>>> there. > Anyone know the reason for that? hyperthreads are not real cores. they make the assumption that you aren't fully using the core (because it is stalled waiting for memory or something like that) and context-switches you to a different set of registers, but useing the same computational resources for your extra 'core' for some applications, this works well, but for others it can be a very significant performance hit. (IIRC, this ranges from +60% to -30% or so in benchmarks). Intel has wonderful marketing and has managed to convince people that HT cores are real cores, but 16 real cores will outperform 8 real cores + 8 HT 'fake' cores every time. the 16 real cores will eat more power, be more expensive, etc so you are paying for the performance. in your case, try your new servers without hyperthreading. you will end up with a 4x4 core system, which should handily outperform the 2x4 core system you are replacing. the limit isn't 8 cores, it's that the hyperthreaded cores don't work well with the postgres access patterns. David Lang
On Mon, Apr 11, 2011 at 1:42 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > A wild guess is something like multiple cores contending for cpu cache, cpu affinity, or some kind of contention in thekernel, alas a little out of my depth. > > It's pretty sickening to think I can't get anything else out of more than 8 cores. Have you tried running the memory stream benchmark Greg Smith had posted here a while back? It'll let you know if you're memory is bottlenecking. Right now my 48 core machines are the king of that benchmark with something like 70+Gig a second.
Glyn Astill <glynastill@yahoo.co.uk> wrote: > The issue I'm seeing is that 8 real cores outperform 16 real > cores, which outperform 32 real cores under high concurrency. With every benchmark I've done of PostgreSQL, the "knee" in the performance graph comes right around ((2 * cores) + effective_spindle_count). With the database fully cached (as I believe you mentioned), effective_spindle_count is zero. If you don't use a connection pool to limit active transactions to the number from that formula, performance drops off. The more CPUs you have, the sharper the drop after the knee. I think it's nearly inevitable that PostgreSQL will eventually add some sort of admission policy or scheduler so that the user doesn't see this effect. With an admission policy, PostgreSQL would effectively throttle the startup of new transactions so that things remained almost flat after the knee. A well-designed scheduler might even be able to sneak marginal improvements past the current knee. As things currently stand it is up to you to do this with a carefully designed connection pool. > 32 cores is much faster than 8 when I have relatively few clients, > but as the number of clients is scaled up 8 cores wins outright. Right. If you were hitting disk heavily with random access, the sweet spot would increase by the number of spindles you were hitting. > I was hoping someone had seen this sort of behaviour before, and > could offer some sort of explanation or advice. When you have multiple resources, adding active processes increases overall throughput until roughly the point when you can keep them all busy. Once you hit that point, adding more processes to contend for the resources just adds overhead and blocking. HT is so bad because it tends to cause context switch storms, but context switching becomes an issue even without it. The other main issue is lock contention. Beyond a certain point, processes start to contend for lightweight locks, so you might context switch to a process only to find that it's still blocked and you have to switch again to try the next process, until you finally find one which can make progress. To acquire the lightweight lock you first need to acquire a spinlock, so as things get busier processes start eating lots of CPU in the spinlock loops trying to get to the point of being able to check the LW locks to see if they're available. You clearly got the best performance with all 32 cores and 16 to 32 processes active. I don't know why you were hitting the knee sooner than I've seen in my benchmarks, but the principle is the same. Use a connection pool which limits how many transactions are active, such that you don't exceed 32 processes busy at the same time, and make sure that it queues transaction requests beyond that so that a new transaction can be started promptly when you are at your limit and a transaction completes. -Kevin
--- On Mon, 11/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote: > From: Scott Marlowe <scott.marlowe@gmail.com> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: "Glyn Astill" <glynastill@yahoo.co.uk> > Cc: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, "Joshua D. Drake" <jd@commandprompt.com>, pgsql-performance@postgresql.org > Date: Monday, 11 April, 2011, 21:52 > On Mon, Apr 11, 2011 at 1:42 PM, Glyn > Astill <glynastill@yahoo.co.uk> > wrote: > > > A wild guess is something like multiple cores > contending for cpu cache, cpu affinity, or some kind of > contention in the kernel, alas a little out of my depth. > > > > It's pretty sickening to think I can't get anything > else out of more than 8 cores. > > Have you tried running the memory stream benchmark Greg > Smith had > posted here a while back? It'll let you know if > you're memory is > bottlenecking. Right now my 48 core machines are the > king of that > benchmark with something like 70+Gig a second. > No I haven't, but I will first thing tomorow morning. I did run a sysbench memory write test though, if I recall correctlythat gave me somewhere just over 3000 Mb/s
>>>>> "GA" == Glyn Astill <glynastill@yahoo.co.uk> writes: GA> I was hoping someone had seen this sort of behaviour before, GA> and could offer some sort of explanation or advice. Jesper's reply is probably most on point as to the reason. I know that recent Opterons use some of their cache to better manage cache-coherency. I presum recent Xeons do so, too, but perhaps yours are not recent enough for that? -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > I don't know why you were hitting the knee sooner than I've seen > in my benchmarks If you're compiling your own executable, you might try boosting LOG2_NUM_LOCK_PARTITIONS (defined in lwlocks.h) to 5 or 6. The current value of 4 means that there are 16 partitions to spread contention for the lightweight locks which protect the heavyweight locking, and this corresponds to your best throughput point. It might be instructive to see what happens when you tweak the number of partitions. Also, if you can profile PostgreSQL at the sweet spot and again at a pessimal load, comparing the profiles should give good clues about the points of contention. -Kevin
On Mon, Apr 11, 2011 at 6:04 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. > > What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on fewercores. The X7550 have "Turbo Boost" which means they will overclock to 2.4 GHz from 2.0 GHz when not all cores are in use per-die. I don't know if it's possible to monitor this, but I think you can disable "Turbo Boost" in bios for further testing. The E5320 CPUs in your old servers doesn't appear "Turbo Boost". -Dave
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Scott Marlowe > Sent: Monday, April 11, 2011 1:29 PM > To: Glyn Astill > Cc: Kevin Grittner; Joshua D. Drake; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > > On Mon, Apr 11, 2011 at 12:23 PM, Glyn Astill <glynastill@yahoo.co.uk> > wrote: > > > > > > --- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote: > > > >> From: Joshua D. Drake <jd@commandprompt.com> > >> Subject: Re: [PERFORM] Linux: more cores = less concurrency. > >> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > >> Cc: pgsql-performance@postgresql.org, "Glyn Astill" > <glynastill@yahoo.co.uk> > >> Date: Monday, 11 April, 2011, 19:12 > >> On Mon, 11 Apr 2011 13:09:15 -0500, > >> "Kevin Grittner" > >> <Kevin.Grittner@wicourts.gov> > >> wrote: > >> > Glyn Astill <glynastill@yahoo.co.uk> > >> wrote: > >> > > >> >> The new server uses 4 x 8 core Xeon X7550 CPUs at > >> 2Ghz > >> > > >> > Which has hyperthreading. > >> > > >> >> our current servers are 2 x 4 core Xeon E5320 CPUs > >> at 2Ghz. > >> > > >> > Which doesn't have hyperthreading. > >> > > > > > Yep, off. If you look at the benchmarks I took, HT absoloutely killed > it. > > > >> > PostgreSQL often performs worse with hyperthreading > >> than without. > >> > Have you turned HT off on your new machine? If > >> not, I would start > >> > there. > >> > >> And then make sure you aren't running CFQ. > >> > >> JD > >> > > > > Not running CFQ, running the no-op i/o scheduler. > > Just FYI, in synthetic pgbench type benchmarks, a 48 core AMD Magny > Cours with LSI HW RAID and 34 15k6 Hard drives scales almost linearly > up to 48 or so threads, getting into the 7000+ tps range. With SW > RAID it gets into the 5500 tps range. Just wondering, which LSI card ? Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some other config? -M > > -- > Sent via pgsql-performance mailing list (pgsql- > performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote: > Just wondering, which LSI card ? > Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some > other config? We were using teh LSI8888 but I'll be switching back to Areca when we go back to HW RAID. The LSI8888 only performed well if we setup 15 RAID-1 pairs in HW and use linux SW RAID 0 on top. RAID1+0 in the LSI8888 was a pretty mediocre performer. Areca 1680 OTOH, beats it in every test, with HW RAID10 only. Much simpler to admin.
On Mon, Apr 11, 2011 at 6:18 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote: >> Just wondering, which LSI card ? >> Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some >> other config? > > We were using teh LSI8888 but I'll be switching back to Areca when we > go back to HW RAID. The LSI8888 only performed well if we setup 15 > RAID-1 pairs in HW and use linux SW RAID 0 on top. RAID1+0 in the > LSI8888 was a pretty mediocre performer. Areca 1680 OTOH, beats it in > every test, with HW RAID10 only. Much simpler to admin. And it was RAID-10 w 4 drives for pg_xlog and RAID-10 with 24 drives for the data store. Both controllers, and pure SW when the LSI8888s cooked inside the poorly cooled Supermicro 1U we had it in.
> -----Original Message----- > From: Scott Marlowe [mailto:scott.marlowe@gmail.com] > Sent: Monday, April 11, 2011 6:18 PM > To: mark > Cc: Glyn Astill; Kevin Grittner; Joshua D. Drake; pgsql- > performance@postgresql.org > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > > On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote: > > Just wondering, which LSI card ? > > Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or > some > > other config? > > We were using teh LSI8888 but I'll be switching back to Areca when we > go back to HW RAID. The LSI8888 only performed well if we setup 15 > RAID-1 pairs in HW and use linux SW RAID 0 on top. RAID1+0 in the > LSI8888 was a pretty mediocre performer. Areca 1680 OTOH, beats it in > every test, with HW RAID10 only. Much simpler to admin. Interesting, thanks for sharing. I guess I have never gotten to the point where I felt I needed more than 2 drives for my xlogs. Maybe I have been dismissing that as a possibility something. (my biggest array is only 24 SFF drives tho) I am trying to get my hands on a dual core lsi card for testing at work. (either a 9265-8i or 9285-8e) don't see any dual core 6Gbps SAS Areca cards yet....still rocking a Arcea 1130 at home tho. -M
On Mon, Apr 11, 2011 at 5:06 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Glyn Astill <glynastill@yahoo.co.uk> wrote: > >> The issue I'm seeing is that 8 real cores outperform 16 real >> cores, which outperform 32 real cores under high concurrency. > > With every benchmark I've done of PostgreSQL, the "knee" in the > performance graph comes right around ((2 * cores) + > effective_spindle_count). With the database fully cached (as I > believe you mentioned), effective_spindle_count is zero. If you > don't use a connection pool to limit active transactions to the > number from that formula, performance drops off. The more CPUs you > have, the sharper the drop after the knee. I was about to say something similar with some canned advice to use a connection pooler to control this. However, OP scaling is more or less topping out at cores / 4...yikes!. Here are my suspicions in rough order: 1. There is scaling problem in client/network/etc. Trivially disproved, convert the test to pgbench -f and post results 2. The test is in fact i/o bound. Scaling is going to be hardware/kernel determined. Can we see iostat/vmstat/top snipped during test run? Maybe no-op is burning you? 3. Locking/concurrency issue in heavy_seat_function() (source for that?) how much writing does it do? Can we see some iobound and cpubound pgbench runs on both servers? merlin
On Mon, Apr 11, 2011 at 6:50 PM, mark <dvlhntr@gmail.com> wrote: > > Interesting, thanks for sharing. > > I guess I have never gotten to the point where I felt I needed more than 2 > drives for my xlogs. Maybe I have been dismissing that as a possibility > something. (my biggest array is only 24 SFF drives tho) > > I am trying to get my hands on a dual core lsi card for testing at work. > (either a 9265-8i or 9285-8e) don't see any dual core 6Gbps SAS Areca cards > yet....still rocking a Arcea 1130 at home tho. Make doubly sure whatever machine you're putting it in moves plenty of air across it's PCI cards. They make plenty of heat. the Areca 1880 are the 6GB/s cards, don't know if they're single or dual core. The LSI interface and command line tools are so horribly designed and the performance was so substandard I've pretty much given up on them. Maybe the newer cards are better, but the 9xxx series wouldn't get along with my motherboard so it was the 8888 or Areca. As for pg_xlog, with 4 drives in a RAID-10 we were hitting a limit with only two drives in RAID-1 against 24 drives in the RAID-10 for the data store in our mixed load. And we use an old 12xx series Areca at work for our primary file server and it's been super reliable for the two years it's been running.
On 2011-04-11 22:39, James Cloos wrote: >>>>>> "GA" == Glyn Astill<glynastill@yahoo.co.uk> writes: > GA> I was hoping someone had seen this sort of behaviour before, > GA> and could offer some sort of explanation or advice. > > Jesper's reply is probably most on point as to the reason. > > I know that recent Opterons use some of their cache to better manage > cache-coherency. I presum recent Xeons do so, too, but perhaps yours > are not recent enough for that? Better cache-coherence also benefits, but it does nothing to the fact that remote DRAM fetches is way more expensive than local ones. (Hard numbers to get excact nowadays). -- Jesper
On Mon, Apr 11, 2011 at 7:04 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > Hi Guys, > > I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql functionto allocate seats) concurrently. We do use connection pooling and split out some selects to slony slaves, but thetests here are primeraly to test what an individual server is capable of. > > The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. > > What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on fewercores. O man, I completely forgot the issue I ran into in my machines, and that was that zone_reclaim completely screwed postgresql and file system performance. On machines with more CPU nodes and higher internode cost it gets turned on automagically and destroys performance for machines that use a lot of kernel cache / shared memory. Be sure and use sysctl.conf to turn it off: vm.zone_reclaim_mode = 0
On 11-4-2011 22:04 david@lang.hm wrote: > in your case, try your new servers without hyperthreading. you will end > up with a 4x4 core system, which should handily outperform the 2x4 core > system you are replacing. > > the limit isn't 8 cores, it's that the hyperthreaded cores don't work > well with the postgres access patterns. It would be really weird if disabling HT would turn these 8-core cpu's in 4-core cpu's ;) They have 8 physical cores and 16 threads each. So he basically has a 32-core machine with 64 threads in total (if HT were enabled). Still, HT may or may not improve things, back when we had time to benchmark new systems we had one of the first HT-Xeon's (a dual 5080, with two cores + HT each) available: http://ic.tweakimg.net/ext/i/1155958729.png The blue lines are all slightly above the orange/red lines. So back then HT slightly improved our read-mostly Postgresql benchmark score. We also did benchmarks with Sun's UltraSparc T2 back then: http://ic.tweakimg.net/ext/i/1214930814.png Adding full cores (including threads) made things much better, but we also tested full cores with more threads each: http://ic.tweakimg.net/ext/i/1214930816.png As you can see, with that benchmark, it was better to have 4 cores with 8 threads each, than 8 cores with 2 threads each. The T2-threads where much heavier duty than the HT-threads back then, but afaik Intel has improved its technology with this re-introduction of them quite a bit. So I wouldn't dismiss hyper threading for a read-mostly Postgresql workload too easily. Then again, keeping 32 cores busy, without them contending for every resource will already be quite hard. So adding 32 additional "threads" may indeed make matters much worse. Best regards, Arjen
--- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote: > >> The issue I'm seeing is that 8 real cores > outperform 16 real > >> cores, which outperform 32 real cores under high > concurrency. > > > > With every benchmark I've done of PostgreSQL, the > "knee" in the > > performance graph comes right around ((2 * cores) + > > effective_spindle_count). With the database fully > cached (as I > > believe you mentioned), effective_spindle_count is > zero. If you > > don't use a connection pool to limit active > transactions to the > > number from that formula, performance drops off. The > more CPUs you > > have, the sharper the drop after the knee. > > I was about to say something similar with some canned > advice to use a > connection pooler to control this. However, OP > scaling is more or > less topping out at cores / 4...yikes!. Here are my > suspicions in > rough order: > > 1. There is scaling problem in client/network/etc. > Trivially > disproved, convert the test to pgbench -f and post results > 2. The test is in fact i/o bound. Scaling is going to be > hardware/kernel determined. Can we see > iostat/vmstat/top snipped > during test run? Maybe no-op is burning you? This is during my 80 clients test, this is a point at which the performance is well below that of the same machine limitedto 8 cores. http://www.privatepaste.com/dc131ff26e > 3. Locking/concurrency issue in heavy_seat_function() > (source for > that?) how much writing does it do? > No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data either,but all clients are after the same data in the test - maybe theres some locks there? > Can we see some iobound and cpubound pgbench runs on both > servers? > Of course, I'll post when I've gotten to that.
--- On Tue, 12/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote: > From: Scott Marlowe <scott.marlowe@gmail.com> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: "Glyn Astill" <glynastill@yahoo.co.uk> > Cc: pgsql-performance@postgresql.org > Date: Tuesday, 12 April, 2011, 6:55 > On Mon, Apr 11, 2011 at 7:04 AM, Glyn > Astill <glynastill@yahoo.co.uk> > wrote: > > Hi Guys, > > > > I'm just doing some tests on a new server running one > of our heavy select functions (the select part of a plpgsql > function to allocate seats) concurrently. We do use > connection pooling and split out some selects to slony > slaves, but the tests here are primeraly to test what an > individual server is capable of. > > > > The new server uses 4 x 8 core Xeon X7550 CPUs at > 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at > 2Ghz. > > > > What I'm seeing is when the number of clients is > greater than the number of cores, the new servers perform > better on fewer cores. > > O man, I completely forgot the issue I ran into in my > machines, and > that was that zone_reclaim completely screwed postgresql > and file > system performance. On machines with more CPU nodes > and higher > internode cost it gets turned on automagically and > destroys > performance for machines that use a lot of kernel cache / > shared > memory. > > Be sure and use sysctl.conf to turn it off: > > vm.zone_reclaim_mode = 0 > I've made this change, not seen any immediate changes however it's good to know. Thanks Scott.
--- On Mon, 11/4/11, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > From: Kevin Grittner <Kevin.Grittner@wicourts.gov> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: david@lang.hm, "Steve Clark" <sclark@netwolves.com>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, "Glyn Astill"<glynastill@yahoo.co.uk> > Cc: "Joshua D. Drake" <jd@commandprompt.com>, "Scott Marlowe" <scott.marlowe@gmail.com>, pgsql-performance@postgresql.org > Date: Monday, 11 April, 2011, 22:35 > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > wrote: > > > I don't know why you were hitting the knee sooner than > I've seen > > in my benchmarks > > If you're compiling your own executable, you might try > boosting > LOG2_NUM_LOCK_PARTITIONS (defined in lwlocks.h) to 5 or > 6. The > current value of 4 means that there are 16 partitions to > spread > contention for the lightweight locks which protect the > heavyweight > locking, and this corresponds to your best throughput > point. It > might be instructive to see what happens when you tweak the > number > of partitions. > Tried tweeking LOG2_NUM_LOCK_PARTITIONS between 5 and 7. My results took a dive when I changed to 32 partitions, and improvedas I increaced to 128, but appeared to be happiest at the default of 16. > Also, if you can profile PostgreSQL at the sweet spot and > again at a > pessimal load, comparing the profiles should give good > clues about > the points of contention. > Results for the same machine on 8 and 32 cores are here: http://www.8kb.co.uk/server_benchmarks/dblt_results.csv Here's the sweet spot for 32 cores, and the 8 core equivalent: http://www.8kb.co.uk/server_benchmarks/iostat-32cores_32Clients.txt http://www.8kb.co.uk/server_benchmarks/vmstat-32cores_32Clients.txt http://www.8kb.co.uk/server_benchmarks/iostat-8cores_32Clients.txt http://www.8kb.co.uk/server_benchmarks/vmstat-8cores_32Clients.txt ... and at the pessimal load for 32 cores, and the 8 core equivalent: http://www.8kb.co.uk/server_benchmarks/iostat-32cores_100Clients.txt http://www.8kb.co.uk/server_benchmarks/vmstat-32cores_100Clients.txt http://www.8kb.co.uk/server_benchmarks/iostat-8cores_100Clients.txt http://www.8kb.co.uk/server_benchmarks/vmstat-8cores_100Clients.txt vmstat shows double the context switches on 32 cores, could this be a factor? Is there anything else I'm missing there? Cheers Glyn
On Tue, Apr 12, 2011 at 3:54 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote: > >> >> The issue I'm seeing is that 8 real cores >> outperform 16 real >> >> cores, which outperform 32 real cores under high >> concurrency. >> > >> > With every benchmark I've done of PostgreSQL, the >> "knee" in the >> > performance graph comes right around ((2 * cores) + >> > effective_spindle_count). With the database fully >> cached (as I >> > believe you mentioned), effective_spindle_count is >> zero. If you >> > don't use a connection pool to limit active >> transactions to the >> > number from that formula, performance drops off. The >> more CPUs you >> > have, the sharper the drop after the knee. >> >> I was about to say something similar with some canned >> advice to use a >> connection pooler to control this. However, OP >> scaling is more or >> less topping out at cores / 4...yikes!. Here are my >> suspicions in >> rough order: >> >> 1. There is scaling problem in client/network/etc. >> Trivially >> disproved, convert the test to pgbench -f and post results >> 2. The test is in fact i/o bound. Scaling is going to be >> hardware/kernel determined. Can we see >> iostat/vmstat/top snipped >> during test run? Maybe no-op is burning you? > > This is during my 80 clients test, this is a point at which the performance is well below that of the same machine limitedto 8 cores. > > http://www.privatepaste.com/dc131ff26e > >> 3. Locking/concurrency issue in heavy_seat_function() >> (source for >> that?) how much writing does it do? >> > > No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data either,but all clients are after the same data in the test - maybe theres some locks there? > >> Can we see some iobound and cpubound pgbench runs on both >> servers? >> > > Of course, I'll post when I've gotten to that. Ok, there's no writing going on -- so the i/o tets aren't necessary. Context switches are also not too high -- the problem is likely in postgres or on your end. However, I Would still like to see: pgbench select only tests: pgbench -i -s 1 pgbench -S -c 8 -t 500 pgbench -S -c 32 -t 500 pgbench -S -c 80 -t 500 pgbench -i -s 500 pgbench -S -c 8 -t 500 pgbench -S -c 32 -t 500 pgbench -S -c 80 -t 500 write out bench.sql with: begin; select * from heavy_seat_function(); select * from heavy_seat_function(); commit; pgbench -n bench.sql -c 8 -t 500 pgbench -n bench.sql -c 8 -t 500 pgbench -n bench.sql -c 8 -t 500 I'm still suspecting an obvious problem here. One thing we may have overlooked is that you are connecting and disconnecting one per benchmarking step (two query executions). If you have heavy RSA encryption enabled on connection establishment, this could eat you. If pgbench results confirm your scaling problems and our issue is not in the general area of connection establishment, it's time to break out the profiler :/. merlin
On Tue, Apr 12, 2011 at 8:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Apr 12, 2011 at 3:54 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: >> --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote: >> >>> >> The issue I'm seeing is that 8 real cores >>> outperform 16 real >>> >> cores, which outperform 32 real cores under high >>> concurrency. >>> > >>> > With every benchmark I've done of PostgreSQL, the >>> "knee" in the >>> > performance graph comes right around ((2 * cores) + >>> > effective_spindle_count). With the database fully >>> cached (as I >>> > believe you mentioned), effective_spindle_count is >>> zero. If you >>> > don't use a connection pool to limit active >>> transactions to the >>> > number from that formula, performance drops off. The >>> more CPUs you >>> > have, the sharper the drop after the knee. >>> >>> I was about to say something similar with some canned >>> advice to use a >>> connection pooler to control this. However, OP >>> scaling is more or >>> less topping out at cores / 4...yikes!. Here are my >>> suspicions in >>> rough order: >>> >>> 1. There is scaling problem in client/network/etc. >>> Trivially >>> disproved, convert the test to pgbench -f and post results >>> 2. The test is in fact i/o bound. Scaling is going to be >>> hardware/kernel determined. Can we see >>> iostat/vmstat/top snipped >>> during test run? Maybe no-op is burning you? >> >> This is during my 80 clients test, this is a point at which the performance is well below that of the same machine limitedto 8 cores. >> >> http://www.privatepaste.com/dc131ff26e >> >>> 3. Locking/concurrency issue in heavy_seat_function() >>> (source for >>> that?) how much writing does it do? >>> >> >> No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data either,but all clients are after the same data in the test - maybe theres some locks there? >> >>> Can we see some iobound and cpubound pgbench runs on both >>> servers? >>> >> >> Of course, I'll post when I've gotten to that. > > Ok, there's no writing going on -- so the i/o tets aren't necessary. > Context switches are also not too high -- the problem is likely in > postgres or on your end. > > However, I Would still like to see: > pgbench select only tests: > pgbench -i -s 1 > pgbench -S -c 8 -t 500 > pgbench -S -c 32 -t 500 > pgbench -S -c 80 -t 500 > > pgbench -i -s 500 > pgbench -S -c 8 -t 500 > pgbench -S -c 32 -t 500 > pgbench -S -c 80 -t 500 > > write out bench.sql with: > begin; > select * from heavy_seat_function(); > select * from heavy_seat_function(); > commit; > > pgbench -n bench.sql -c 8 -t 500 > pgbench -n bench.sql -c 8 -t 500 > pgbench -n bench.sql -c 8 -t 500 whoops: pgbench -n bench.sql -c 8 -t 500 pgbench -n bench.sql -c 32 -t 500 pgbench -n bench.sql -c 80 -t 500 merlin
Glyn Astill <glynastill@yahoo.co.uk> wrote: > Tried tweeking LOG2_NUM_LOCK_PARTITIONS between 5 and 7. My > results took a dive when I changed to 32 partitions, and improved > as I increaced to 128, but appeared to be happiest at the default > of 16. Good to know. >> Also, if you can profile PostgreSQL at the sweet spot and again >> at a pessimal load, comparing the profiles should give good clues >> about the points of contention. > [iostat and vmstat output] Wow, zero idle and zero wait, and single digit for system. Did you ever run those RAM speed tests? (I don't remember seeing results for that -- or failed to recognize them.) At this point, my best guess at this point is that you don't have the bandwidth to RAM to support the CPU power. Databases tend to push data around in RAM a lot. When I mentioned profiling, I was thinking more of oprofile or something like it. If it were me, I'd be going there by now. -Kevin
--- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote: > >>> Can we see some iobound and cpubound pgbench > runs on both > >>> servers? > >>> > >> > >> Of course, I'll post when I've gotten to that. > > > > Ok, there's no writing going on -- so the i/o tets > aren't necessary. > > Context switches are also not too high -- the problem > is likely in > > postgres or on your end. > > > > However, I Would still like to see: > > pgbench select only tests: > > pgbench -i -s 1 > > pgbench -S -c 8 -t 500 > > pgbench -S -c 32 -t 500 > > pgbench -S -c 80 -t 500 > > > > pgbench -i -s 500 > > pgbench -S -c 8 -t 500 > > pgbench -S -c 32 -t 500 > > pgbench -S -c 80 -t 500 > > > > write out bench.sql with: > > begin; > > select * from heavy_seat_function(); > > select * from heavy_seat_function(); > > commit; > > > > pgbench -n bench.sql -c 8 -t 500 > > pgbench -n bench.sql -c 8 -t 500 > > pgbench -n bench.sql -c 8 -t 500 > > whoops: > pgbench -n bench.sql -c 8 -t 500 > pgbench -n bench.sql -c 32 -t 500 > pgbench -n bench.sql -c 80 -t 500 > > merlin > Right, here they are: http://www.privatepaste.com/3dd777f4db
--- On Tue, 12/4/11, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Wow, zero idle and zero wait, and single digit for > system. Did you > ever run those RAM speed tests? (I don't remember > seeing results > for that -- or failed to recognize them.) At this > point, my best > guess at this point is that you don't have the bandwidth to > RAM to > support the CPU power. Databases tend to push data > around in RAM a > lot. I mentioned sysbench was giving me something like 3000 MB/sec on memory write tests, but nothing more. Results from Greg Smiths stream_scaling test are here: http://www.privatepaste.com/4338aa1196 > > When I mentioned profiling, I was thinking more of oprofile > or > something like it. If it were me, I'd be going there > by now. > Advice taken, it'll be my next step. Glyn
On Tue, Apr 12, 2011 at 11:01 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: > --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote: > >> >>> Can we see some iobound and cpubound pgbench >> runs on both >> >>> servers? >> >>> >> >> >> >> Of course, I'll post when I've gotten to that. >> > >> > Ok, there's no writing going on -- so the i/o tets >> aren't necessary. >> > Context switches are also not too high -- the problem >> is likely in >> > postgres or on your end. >> > >> > However, I Would still like to see: >> > pgbench select only tests: >> > pgbench -i -s 1 >> > pgbench -S -c 8 -t 500 >> > pgbench -S -c 32 -t 500 >> > pgbench -S -c 80 -t 500 >> > >> > pgbench -i -s 500 >> > pgbench -S -c 8 -t 500 >> > pgbench -S -c 32 -t 500 >> > pgbench -S -c 80 -t 500 >> > >> > write out bench.sql with: >> > begin; >> > select * from heavy_seat_function(); >> > select * from heavy_seat_function(); >> > commit; >> > >> > pgbench -n bench.sql -c 8 -t 500 >> > pgbench -n bench.sql -c 8 -t 500 >> > pgbench -n bench.sql -c 8 -t 500 >> >> whoops: >> pgbench -n bench.sql -c 8 -t 500 >> pgbench -n bench.sql -c 32 -t 500 >> pgbench -n bench.sql -c 80 -t 500 >> >> merlin >> > > Right, here they are: > > http://www.privatepaste.com/3dd777f4db your results unfortunately confirmed the worst -- no easy answers on this one :(. Before breaking out the profiler, can you take some random samples of: select count(*) from pg_stat_activity where waiting; to see if you have any locking issues? Also, are you sure your function executions are relatively free of side effects? I can take a look at the code off list if you'd prefer to keep it discrete. merlin
Glyn Astill <glynastill@yahoo.co.uk> wrote: > Results from Greg Smiths stream_scaling test are here: > > http://www.privatepaste.com/4338aa1196 Well, that pretty much clinches it. Your RAM access tops out at 16 processors. It appears that your processors are spending most of their time waiting for and contending for the RAM bus. I have gotten machines in where moving a jumper, flipping a DIP switch, or changing BIOS options from the default made a big difference. I'd be looking at the manuals for my motherboard and BIOS right now to see what options there might be to improve that. -Kevin
On Tue, Apr 12, 2011 at 6:40 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > > Well, that pretty much clinches it. Your RAM access tops out at 16 > processors. It appears that your processors are spending most of > their time waiting for and contending for the RAM bus. It tops, but it doesn't drop. I'd propose that the perceived drop in TPS is due to cache contention - ie, more processes fighting for the scarce cache means less efficient use of the (constant upwards of 16 processes) bandwidth. So... the solution would be to add more servers, rather than just sockets. (or a server with more sockets *and* more bandwidth)
Hi, I think that a NUMA architecture machine can solve the problem.... A + Le 11/04/2011 15:04, Glyn Astill a écrit : > > Hi Guys, > > I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql functionto allocate seats) concurrently. We do use connection pooling and split out some selects to slony slaves, but thetests here are primeraly to test what an individual server is capable of. > > The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz. > > What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on fewercores. > > Has anyone else seen this behaviour? I'm guessing this is either a hardware limitation or something to do with linux processmanagement / scheduling? Any idea what to look into? > > My benchmark utility is just using a little .net/npgsql app that runs increacing numbers of clients concurrently, eachclient runs a specified number of iterations of any sql I specify. > > I've posted some results and the test program here: > > http://www.8kb.co.uk/server_benchmarks/ > > -- Frédéric BROUARD - expert SGBDR et SQL - MVP SQL Server - 06 11 86 40 66 Le site sur le langage SQL et les SGBDR : http://sqlpro.developpez.com Enseignant Arts & Métiers PACA, ISEN Toulon et CESI/EXIA Aix en Provence Audit, conseil, expertise, formation, modélisation, tuning, optimisation *********************** http://www.sqlspot.com *************************
Kevin Grittner wrote: > Glyn Astill <glynastill@yahoo.co.uk> wrote: > > >> Results from Greg Smiths stream_scaling test are here: >> >> http://www.privatepaste.com/4338aa1196 >> > > Well, that pretty much clinches it. Your RAM access tops out at 16 > processors. It appears that your processors are spending most of > their time waiting for and contending for the RAM bus. > I've pulled Glyn's results into https://github.com/gregs1104/stream-scaling so they're easy to compare against similar processors, his system is the one labled 4 X X7550. I'm hearing this same story from multiple people lately: these 32+ core servers bottleneck on aggregate memory speed with running PostgreSQL long before the CPUs are fully utilized. This server is close to maximum memory utilization at 8 cores, and the small increase in gross throughput above that doesn't seem to be making up for the loss in L1 and L2 thrashing from trying to run more. These systems with many cores can only be used fully if you have a program that can work efficiency some of the time with just local CPU resources. That's very rarely the case for a database that's moving 8K pages, tuple caches, and other forms of working memory around all the time. > I have gotten machines in where moving a jumper, flipping a DIP > switch, or changing BIOS options from the default made a big > difference. I'd be looking at the manuals for my motherboard and > BIOS right now to see what options there might be to improve that I already forwarded Glyn a good article about tuning these Dell BIOSs in particular from an interesting blog series others here might like too: http://bleything.net/articles/postgresql-benchmarking-memory.html Ben Bleything is doing a very thorough walk-through of server hardware validation, and as is often the case he's already found one major problem with the vendor config he had to fix to get expected results. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Scott Marlowe wrote: > Have you tried running the memory stream benchmark Greg Smith had > posted here a while back? It'll let you know if you're memory is > bottlenecking. Right now my 48 core machines are the king of that > benchmark with something like 70+Gig a second. > The big Opterons are still the front-runners here, but not with 70GB/s anymore. Earlier versions of stream-scaling didn't use nearly enough data to avoid L3 cache in the processors interfering with results. More recent tests I've gotten in done after I expanded the default test size for them show the Opterons normally hitting the same ~35GB/s maximum throughput that the Intel processors get out of similar DDR3/1333 sets. There are some outliers where >50GB/s still shows up. I'm not sure if I really believe them though; attempts to increase the test size now hit a 32-bit limit inside stream.c, and I think that's not really big enough to avoid L3 cache effects here. In the table at https://github.com/gregs1104/stream-scaling the 4 X 6172 server is similar to Scott's system. I believe the results for 8 (37613) and 48 cores (32301) there. I remain somewhat suspicious that the higher reuslts of 40 - 51GB/s shown between 16 and 32 cores may be inflated by caching. At this point I'll probably need direct access to one of them to resolve this for sure. I've made a lot of progress with other people's servers, but complete trust in those particular results still isn't there yet. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Tue, Apr 12, 2011 at 12:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Kevin Grittner wrote: >> >> Glyn Astill <glynastill@yahoo.co.uk> wrote: >> >>> >>> Results from Greg Smiths stream_scaling test are here: >>> >>> http://www.privatepaste.com/4338aa1196 >>> >> >> Well, that pretty much clinches it. Your RAM access tops out at 16 >> processors. It appears that your processors are spending most of >> their time waiting for and contending for the RAM bus. >> > > I've pulled Glyn's results into https://github.com/gregs1104/stream-scaling > so they're easy to compare against similar processors, his system is the one > labled 4 X X7550. I'm hearing this same story from multiple people lately: > these 32+ core servers bottleneck on aggregate memory speed with running > PostgreSQL long before the CPUs are fully utilized. This server is close to > maximum memory utilization at 8 cores, and the small increase in gross > throughput above that doesn't seem to be making up for the loss in L1 and L2 > thrashing from trying to run more. These systems with many cores can only > be used fully if you have a program that can work efficiency some of the > time with just local CPU resources. That's very rarely the case for a > database that's moving 8K pages, tuple caches, and other forms of working > memory around all the time. > > >> I have gotten machines in where moving a jumper, flipping a DIP >> switch, or changing BIOS options from the default made a big >> difference. I'd be looking at the manuals for my motherboard and >> BIOS right now to see what options there might be to improve that > > I already forwarded Glyn a good article about tuning these Dell BIOSs in > particular from an interesting blog series others here might like too: > > http://bleything.net/articles/postgresql-benchmarking-memory.html > > Ben Bleything is doing a very thorough walk-through of server hardware > validation, and as is often the case he's already found one major problem > with the vendor config he had to fix to get expected results. For posterity, since it looks like you guys have nailed this one, I took a look at some of the code off list and I can confirm there is no obvious bottleneck coming from locking type issues. The functions are 'stable' as implemented with no fancy tricks. merlin
When purchasing the intel 7500 series, please make sure to check the hemisphere mode of your memory configuration. Thereis a HUGE difference in the memory configuration around 50% speed if you don't populate all the memory slots on thecontrollers properly. https://globalsp.ts.fujitsu.com/dmsp/docs/wp-nehalem-ex-memory-performance-ww-en.pdf - John -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure Sent: Tuesday, April 12, 2011 12:14 PM To: Greg Smith Cc: Kevin Grittner; david@lang.hm; Steve Clark; Glyn Astill; Joshua D. Drake; Scott Marlowe; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Linux: more cores = less concurrency. On Tue, Apr 12, 2011 at 12:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Kevin Grittner wrote: >> >> Glyn Astill <glynastill@yahoo.co.uk> wrote: >> >>> >>> Results from Greg Smiths stream_scaling test are here: >>> >>> http://www.privatepaste.com/4338aa1196 >>> >> >> Well, that pretty much clinches it. Your RAM access tops out at 16 >> processors. It appears that your processors are spending most of >> their time waiting for and contending for the RAM bus. >> > > I've pulled Glyn's results into > https://github.com/gregs1104/stream-scaling > so they're easy to compare against similar processors, his system is > the one labled 4 X X7550. I'm hearing this same story from multiple people lately: > these 32+ core servers bottleneck on aggregate memory speed with > running PostgreSQL long before the CPUs are fully utilized. This > server is close to maximum memory utilization at 8 cores, and the > small increase in gross throughput above that doesn't seem to be > making up for the loss in L1 and L2 thrashing from trying to run more. > These systems with many cores can only be used fully if you have a > program that can work efficiency some of the time with just local CPU > resources. That's very rarely the case for a database that's moving > 8K pages, tuple caches, and other forms of working memory around all the time. > > >> I have gotten machines in where moving a jumper, flipping a DIP >> switch, or changing BIOS options from the default made a big >> difference. I'd be looking at the manuals for my motherboard and >> BIOS right now to see what options there might be to improve that > > I already forwarded Glyn a good article about tuning these Dell BIOSs > in particular from an interesting blog series others here might like too: > > http://bleything.net/articles/postgresql-benchmarking-memory.html > > Ben Bleything is doing a very thorough walk-through of server hardware > validation, and as is often the case he's already found one major > problem with the vendor config he had to fix to get expected results. For posterity, since it looks like you guys have nailed this one, I took a look at some of the code off list and I can confirmthere is no obvious bottleneck coming from locking type issues. The functions are 'stable' as implemented with nofancy tricks. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase & Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase & Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities.
--- On Tue, 12/4/11, Greg Smith <greg@2ndquadrant.com> wrote: > From: Greg Smith <greg@2ndquadrant.com> > Subject: Re: [PERFORM] Linux: more cores = less concurrency. > To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > Cc: david@lang.hm, "Steve Clark" <sclark@netwolves.com>, "Glyn Astill" <glynastill@yahoo.co.uk>, "Joshua D. Drake" <jd@commandprompt.com>,"Scott Marlowe" <scott.marlowe@gmail.com>, pgsql-performance@postgresql.org > Date: Tuesday, 12 April, 2011, 18:00 > Kevin Grittner wrote: > > Glyn Astill <glynastill@yahoo.co.uk> > wrote: > > > >> Results from Greg Smiths stream_scaling test are > here: > >> > >> http://www.privatepaste.com/4338aa1196 > >> > > Well, that pretty much clinches it. Your > RAM access tops out at 16 > > processors. It appears that your processors are > spending most of > > their time waiting for and contending for the RAM > bus. > > > > I've pulled Glyn's results into https://github.com/gregs1104/stream-scaling so they're > easy to compare against similar processors, his system is > the one labled 4 X X7550. I'm hearing this same story > from multiple people lately: these 32+ core servers > bottleneck on aggregate memory speed with running PostgreSQL > long before the CPUs are fully utilized. This server > is close to maximum memory utilization at 8 cores, and the > small increase in gross throughput above that doesn't seem > to be making up for the loss in L1 and L2 thrashing from > trying to run more. These systems with many cores can > only be used fully if you have a program that can work > efficiency some of the time with just local CPU > resources. That's very rarely the case for a database > that's moving 8K pages, tuple caches, and other forms of > working memory around all the time. > > > > I have gotten machines in where moving a jumper, > flipping a DIP > > switch, or changing BIOS options from the default made > a big > > difference. I'd be looking at the manuals for my > motherboard and > > BIOS right now to see what options there might be to > improve that > > I already forwarded Glyn a good article about tuning these > Dell BIOSs in particular from an interesting blog series > others here might like too: > > http://bleything.net/articles/postgresql-benchmarking-memory.html > > Ben Bleything is doing a very thorough walk-through of > server hardware validation, and as is often the case he's > already found one major problem with the vendor config he > had to fix to get expected results. > Thanks Greg. I've been through that post, but unfortunately there's no settings that make a difference. However upon further investigation and looking at the manual for the R910 here http://support.dell.com/support/edocs/systems/per910/en/HOM/HTML/install.htm#wp1266264 I've discovered we only have 4 of the 8 memory risers, and the manual states that in this configuration we are running in"Power Optimized" mode, rather than "Performance Optimized". We've got two of these machines, so I've just pulled all the risers from one system, removed half the memory as indicatedby that document from Dell above, and now I'm seeing almost double the throughput.
If postgres is memory bandwidth constrained, what can be done to reduce its bandwidth use? Huge Pages could help some, by reducing page table lookups and making overall access more efficient. Compressed pages (speedy / lzo) in memory can help trade CPU cycles for memory usage for certain memory segments/pages -- this could potentially save a lot of I/O too if more pages fit in RAM as a result, and also make caches more effective. As I've noted before, the optimizer inappropriately choses the larger side of a join to hash instead of the smaller one in many cases on hash joins, which is less cache efficient. Dual-pivot quicksort is more cache firendly than Postgres' single pivit one and uses less memory bandwidth on average (fewer swaps, but the same number of compares). On 4/13/11 2:48 AM, "Glyn Astill" <glynastill@yahoo.co.uk> wrote: >--- On Tue, 12/4/11, Greg Smith <greg@2ndquadrant.com> wrote: > >> >> > >Thanks Greg. I've been through that post, but unfortunately there's no >settings that make a difference. > >However upon further investigation and looking at the manual for the R910 >here > >http://support.dell.com/support/edocs/systems/per910/en/HOM/HTML/install.h >tm#wp1266264 > >I've discovered we only have 4 of the 8 memory risers, and the manual >states that in this configuration we are running in "Power Optimized" >mode, rather than "Performance Optimized". > >We've got two of these machines, so I've just pulled all the risers from >one system, removed half the memory as indicated by that document from >Dell above, and now I'm seeing almost double the throughput. > > > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
Scott Carey wrote: > If postgres is memory bandwidth constrained, what can be done to reduce > its bandwidth use? > > Huge Pages could help some, by reducing page table lookups and making > overall access more efficient. > Compressed pages (speedy / lzo) in memory can help trade CPU cycles for > memory usage for certain memory segments/pages -- this could potentially > save a lot of I/O too if more pages fit in RAM as a result, and also make > caches more effective. > The problem with a lot of these ideas is that they trade the memory problem for increased disruption to the CPU L1 and L2 caches. I don't know how much that moves the bottleneck forward. And not every workload is memory constrained, either, so those that aren't might suffer from the same optimizations that help in this situation. I just posted my slides from my MySQL conference talk today at http://projects.2ndquadrant.com/talks , and those include some graphs of recent data collected with stream-scaling. The current situation is really strange in both Intel and AMD's memory architectures. I'm even seeing situations where lightly loaded big servers are actually outperformed by small ones running the same workload. The 32 and 48 core systems using server-class DDR3/1333 just don't have the bandwidth to a single core that, say, an i7 desktop using triple-channel DDR3-1600 does. The trade-offs here are extremely hardware and workload dependent, and it's very easy to tune for one combination while slowing another. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
* Jesper Krogh: > If you have a 1 socket system, all of your data can be fetched from > "local" ram seen from you cpu, on a 2 socket, 50% of your accesses > will be "way slower", 4 socket even worse. There are non-NUMA multi-socket systems, so this doesn't apply in all cases. (The E5320-based system is likely non-NUMA.) Speaking about NUMA, do you know if there are some non-invasive tools which can be used to monitor page migration and off-node memory accesses? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
2011/4/14 Florian Weimer <fweimer@bfk.de>: > * Jesper Krogh: > >> If you have a 1 socket system, all of your data can be fetched from >> "local" ram seen from you cpu, on a 2 socket, 50% of your accesses >> will be "way slower", 4 socket even worse. > > There are non-NUMA multi-socket systems, so this doesn't apply in all > cases. (The E5320-based system is likely non-NUMA.) > > Speaking about NUMA, do you know if there are some non-invasive tools > which can be used to monitor page migration and off-node memory > accesses? I am unsure it is exactly what you are looking for, but linux do provide access to counters in: /sys/devices/system/node/node*/numastat I also find usefull to check meminfo per node instead of via /proc -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On 4/13/11 9:23 PM, "Greg Smith" <greg@2ndquadrant.com> wrote: >Scott Carey wrote: >> If postgres is memory bandwidth constrained, what can be done to reduce >> its bandwidth use? >> >> Huge Pages could help some, by reducing page table lookups and making >> overall access more efficient. >> Compressed pages (speedy / lzo) in memory can help trade CPU cycles for >> memory usage for certain memory segments/pages -- this could potentially >> save a lot of I/O too if more pages fit in RAM as a result, and also >>make >> caches more effective. >> > >The problem with a lot of these ideas is that they trade the memory >problem for increased disruption to the CPU L1 and L2 caches. I don't >know how much that moves the bottleneck forward. And not every workload >is memory constrained, either, so those that aren't might suffer from >the same optimizations that help in this situation. Compression has this problem, but I'm not sure where the plural "a lot of these ideas" comes from. Huge Pages helps caches. Dual-Pivot quicksort is more cache friendly and is _always_ equal to or faster than traditional quicksort (its a provably improved algorithm). Smaller hash tables help caches. > >I just posted my slides from my MySQL conference talk today at >http://projects.2ndquadrant.com/talks , and those include some graphs of >recent data collected with stream-scaling. The current situation is >really strange in both Intel and AMD's memory architectures. I'm even >seeing situations where lightly loaded big servers are actually >outperformed by small ones running the same workload. The 32 and 48 >core systems using server-class DDR3/1333 just don't have the bandwidth >to a single core that, say, an i7 desktop using triple-channel DDR3-1600 >does. The trade-offs here are extremely hardware and workload >dependent, and it's very easy to tune for one combination while slowing >another. > >-- >Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD >PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us >"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books >
On Thu, Apr 14, 2011 at 10:05 PM, Scott Carey <scott@richrelevance.com> wrote: > Huge Pages helps caches. > Dual-Pivot quicksort is more cache friendly and is _always_ equal to or > faster than traditional quicksort (its a provably improved algorithm). If you want a cache-friendly sorting algorithm, you need mergesort. I don't know any algorithm as friendly to caches as mergesort. Quicksort could be better only when the sorting buffer is guaranteed to fit on the CPU's cache, and that's usually just a few 4kb pages.
On 4/14/11 1:19 PM, "Claudio Freire" <klaussfreire@gmail.com> wrote: >On Thu, Apr 14, 2011 at 10:05 PM, Scott Carey <scott@richrelevance.com> >wrote: >> Huge Pages helps caches. >> Dual-Pivot quicksort is more cache friendly and is _always_ equal to or >> faster than traditional quicksort (its a provably improved algorithm). > >If you want a cache-friendly sorting algorithm, you need mergesort. > >I don't know any algorithm as friendly to caches as mergesort. > >Quicksort could be better only when the sorting buffer is guaranteed >to fit on the CPU's cache, and that's usually just a few 4kb pages. Of mergesort variants, Timsort is a recent general purpose variant favored by many since it is sub- O(n log(n)) on partially sorted data. Which work best under which circumstances depends a lot on the size of the data, size of the elements, cost of the compare function, whether you're sorting the data directly or sorting pointers, and other factors. Mergesort may be more cache friendly (?) but might use more memory bandwidth. I'm not sure. I do know that dual-pivot quicksort provably causes fewer swaps (but the same # of compares) as the usual single-pivot quicksort. And swaps are a lot slower than you would expect due to the effects on processor caches. Therefore it might help with multiprocessor scalability by reducing memory/cache pressure.
On Fri, Apr 15, 2011 at 12:42 AM, Scott Carey <scott@richrelevance.com> wrote: > I do know that dual-pivot quicksort provably causes fewer swaps (but the > same # of compares) as the usual single-pivot quicksort. And swaps are a > lot slower than you would expect due to the effects on processor caches. > Therefore it might help with multiprocessor scalability by reducing > memory/cache pressure. I agree, and it's quite non-disruptive - ie, a drop-in replacement for quicksort, whereas mergesort or timsort both require bigger changes and heavier profiling.