Thread: Linux: more cores = less concurrency.

Linux: more cores = less concurrency.

From
Glyn Astill
Date:
Hi Guys,

I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql
functionto allocate seats) concurrently.  We do use connection pooling and split out some selects to slony slaves, but
thetests here are primeraly to test what an individual server is capable of. 

The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.

What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on
fewercores. 

Has anyone else seen this behaviour?  I'm guessing this is either a hardware limitation or something to do with linux
processmanagement / scheduling? Any idea what to look into? 

My benchmark utility is just using a little .net/npgsql app that runs increacing numbers of clients concurrently, each
clientruns a specified number of iterations of any sql I specify. 

I've posted some results and the test program here:

http://www.8kb.co.uk/server_benchmarks/


Re: Linux: more cores = less concurrency.

From
"Kevin Grittner"
Date:
Glyn Astill <glynastill@yahoo.co.uk> wrote:

> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz

Which has hyperthreading.

> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.

Which doesn't have hyperthreading.

PostgreSQL often performs worse with hyperthreading than without.
Have you turned HT off on your new machine?  If not, I would start
there.

-Kevin

Re: Linux: more cores = less concurrency.

From
"Joshua D. Drake"
Date:
On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>
>> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz
>
> Which has hyperthreading.
>
>> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>
> Which doesn't have hyperthreading.
>
> PostgreSQL often performs worse with hyperthreading than without.
> Have you turned HT off on your new machine?  If not, I would start
> there.

And then make sure you aren't running CFQ.

JD

>
> -Kevin

--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:

--- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote:

> From: Joshua D. Drake <jd@commandprompt.com>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> Cc: pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk>
> Date: Monday, 11 April, 2011, 19:12
> On Mon, 11 Apr 2011 13:09:15 -0500,
> "Kevin Grittner"
> <Kevin.Grittner@wicourts.gov>
> wrote:
> > Glyn Astill <glynastill@yahoo.co.uk>
> wrote:
> > 
> >> The new server uses 4 x 8 core Xeon X7550 CPUs at
> 2Ghz
> > 
> > Which has hyperthreading.
> > 
> >> our current servers are 2 x 4 core Xeon E5320 CPUs
> at 2Ghz.
> > 
> > Which doesn't have hyperthreading.
> > 

Yep, off. If you look at the benchmarks I took, HT absoloutely killed it.

> > PostgreSQL often performs worse with hyperthreading
> than without.
> > Have you turned HT off on your new machine?  If
> not, I would start
> > there.
>
> And then make sure you aren't running CFQ.
>
> JD
>

Not running CFQ, running the no-op i/o scheduler.

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner"
> <Kevin.Grittner@wicourts.gov> wrote:
>> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>>
>>> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz
>>
>> Which has hyperthreading.
>>
>>> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>>
>> Which doesn't have hyperthreading.
>>
>> PostgreSQL often performs worse with hyperthreading than without.
>> Have you turned HT off on your new machine?  If not, I would start
>> there.
>
> And then make sure you aren't running CFQ.
>
> JD

This++

Also if you're running a good hardware RAID controller, jsut go to NOOP

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 12:23 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
>
>
> --- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote:
>
>> From: Joshua D. Drake <jd@commandprompt.com>
>> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
>> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
>> Cc: pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk>
>> Date: Monday, 11 April, 2011, 19:12
>> On Mon, 11 Apr 2011 13:09:15 -0500,
>> "Kevin Grittner"
>> <Kevin.Grittner@wicourts.gov>
>> wrote:
>> > Glyn Astill <glynastill@yahoo.co.uk>
>> wrote:
>> >
>> >> The new server uses 4 x 8 core Xeon X7550 CPUs at
>> 2Ghz
>> >
>> > Which has hyperthreading.
>> >
>> >> our current servers are 2 x 4 core Xeon E5320 CPUs
>> at 2Ghz.
>> >
>> > Which doesn't have hyperthreading.
>> >
>
> Yep, off. If you look at the benchmarks I took, HT absoloutely killed it.
>
>> > PostgreSQL often performs worse with hyperthreading
>> than without.
>> > Have you turned HT off on your new machine?  If
>> not, I would start
>> > there.
>>
>> And then make sure you aren't running CFQ.
>>
>> JD
>>
>
> Not running CFQ, running the no-op i/o scheduler.

Just FYI, in synthetic pgbench type benchmarks, a 48 core AMD Magny
Cours with LSI HW RAID and 34 15k6 Hard drives scales almost linearly
up to 48 or so threads, getting into the 7000+ tps range.  With SW
RAID it gets into the 5500 tps range.

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Mon, 11/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> Just FYI, in synthetic pgbench type benchmarks, a 48 core
> AMD Magny
> Cours with LSI HW RAID and 34 15k6 Hard drives scales
> almost linearly
> up to 48 or so threads, getting into the 7000+ tps
> range.  With SW
> RAID it gets into the 5500 tps range.
>

I'll have to try with the synthetic benchmarks next then, but somethings definately going off here.  I'm seeing no disk
activityat all as they're selects and all pages are in ram. 

I was wondering if anyone had any deeper knowledge of any kernel tunables, or anything else for that matter.

A wild guess is something like multiple cores contending for cpu cache, cpu affinity, or some kind of contention in the
kernel,alas a little out of my depth. 

It's pretty sickening to think I can't get anything else out of more than 8 cores.

Re: Linux: more cores = less concurrency.

From
Steve Clark
Date:
On 04/11/2011 02:32 PM, Scott Marlowe wrote:
On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
Glyn Astill <glynastill@yahoo.co.uk> wrote:

The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz
Which has hyperthreading.

our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
Which doesn't have hyperthreading.

PostgreSQL often performs worse with hyperthreading than without.
Have you turned HT off on your new machine?  If not, I would start
there.
Anyone know the reason for that?
And then make sure you aren't running CFQ.

JD
This++

Also if you're running a good hardware RAID controller, jsut go to NOOP



--
Stephen Clark
NetWolves
Sr. Software Engineer III
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
http://www.netwolves.com

Re: Linux: more cores = less concurrency.

From
Jesper Krogh
Date:
On 2011-04-11 21:42, Glyn Astill wrote:
>
> I'll have to try with the synthetic benchmarks next then, but somethings definately going off here.  I'm seeing no
diskactivity at all as they're selects and all pages are in ram. 
Well, if you dont have enough computations to be bottlenecked on the
cpu, then a 4 socket system is slower than a comparative 2 socket system
and a 1 socket system is even better.

If you have a 1 socket system, all of your data can be fetched from
"local" ram seen from you cpu, on a 2 socket, 50% of your accesses
will be "way slower", 4 socket even worse.

So the more sockets first begin to kick in when you can actually
use the CPU's or add in even more memory to keep your database
from going to disk due to size.

--
Jesper

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Mon, 11/4/11, david@lang.hm <david@lang.hm> wrote:

> From: david@lang.hm <david@lang.hm>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: "Steve Clark" <sclark@netwolves.com>
> Cc: "Scott Marlowe" <scott.marlowe@gmail.com>, "Joshua D. Drake" <jd@commandprompt.com>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov>,pgsql-performance@postgresql.org, "Glyn Astill" <glynastill@yahoo.co.uk> 
> Date: Monday, 11 April, 2011, 21:04
> On Mon, 11 Apr 2011, Steve Clark
> wrote:
>
> the limit isn't 8 cores, it's that the hyperthreaded cores
> don't work well with the postgres access patterns.
>

This has nothing to do with hyperthreading. I have a hyperthreaded benchmark purely for completion, but can we please
forgetabout it. 

The issue I'm seeing is that 8 real cores outperform 16 real cores, which outperform 32 real cores under high
concurrency.

32 cores is much faster than 8 when I have relatively few clients, but as the number of clients is scaled up 8 cores
winsoutright. 

I was hoping someone had seen this sort of behaviour before, and could offer some sort of explanation or advice.

Re: Linux: more cores = less concurrency.

From
david@lang.hm
Date:
On Mon, 11 Apr 2011, Steve Clark wrote:

> On 04/11/2011 02:32 PM, Scott Marlowe wrote:
>> On Mon, Apr 11, 2011 at 12:12 PM, Joshua D. Drake<jd@commandprompt.com>
>> wrote:
>>> On Mon, 11 Apr 2011 13:09:15 -0500, "Kevin Grittner"
>>> <Kevin.Grittner@wicourts.gov>  wrote:
>>>> Glyn Astill<glynastill@yahoo.co.uk>  wrote:
>>>>
>>>>> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz
>>>> Which has hyperthreading.
>>>>
>>>>> our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>>>> Which doesn't have hyperthreading.
>>>>
>>>> PostgreSQL often performs worse with hyperthreading than without.
>>>> Have you turned HT off on your new machine?  If not, I would start
>>>> there.
> Anyone know the reason for that?

hyperthreads are not real cores.

they make the assumption that you aren't fully using the core (because it
is stalled waiting for memory or something like that) and context-switches
you to a different set of registers, but useing the same computational
resources for your extra 'core'

for some applications, this works well, but for others it can be a very
significant performance hit. (IIRC, this ranges from +60% to -30% or so in
benchmarks).

Intel has wonderful marketing and has managed to convince people that HT
cores are real cores, but 16 real cores will outperform 8 real cores + 8
HT 'fake' cores every time. the 16 real cores will eat more power, be more
expensive, etc so you are paying for the performance.

in your case, try your new servers without hyperthreading. you will end up
with a 4x4 core system, which should handily outperform the 2x4 core
system you are replacing.

the limit isn't 8 cores, it's that the hyperthreaded cores don't work well
with the postgres access patterns.

David Lang

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 1:42 PM, Glyn Astill <glynastill@yahoo.co.uk> wrote:

> A wild guess is something like multiple cores contending for cpu cache, cpu affinity, or some kind of contention in
thekernel, alas a little out of my depth. 
>
> It's pretty sickening to think I can't get anything else out of more than 8 cores.

Have you tried running the memory stream benchmark Greg Smith had
posted here a while back?  It'll let you know if you're memory is
bottlenecking.  Right now my 48 core machines are the king of that
benchmark with something like 70+Gig a second.

Re: Linux: more cores = less concurrency.

From
"Kevin Grittner"
Date:
Glyn Astill <glynastill@yahoo.co.uk> wrote:

> The issue I'm seeing is that 8 real cores outperform 16 real
> cores, which outperform 32 real cores under high concurrency.

With every benchmark I've done of PostgreSQL, the "knee" in the
performance graph comes right around ((2 * cores) +
effective_spindle_count).  With the database fully cached (as I
believe you mentioned), effective_spindle_count is zero.  If you
don't use a connection pool to limit active transactions to the
number from that formula, performance drops off.  The more CPUs you
have, the sharper the drop after the knee.

I think it's nearly inevitable that PostgreSQL will eventually add
some sort of admission policy or scheduler so that the user doesn't
see this effect.  With an admission policy, PostgreSQL would
effectively throttle the startup of new transactions so that things
remained almost flat after the knee.  A well-designed scheduler
might even be able to sneak marginal improvements past the current
knee.  As things currently stand it is up to you to do this with a
carefully designed connection pool.

> 32 cores is much faster than 8 when I have relatively few clients,
> but as the number of clients is scaled up 8 cores wins outright.

Right.  If you were hitting disk heavily with random access, the
sweet spot would increase by the number of spindles you were
hitting.

> I was hoping someone had seen this sort of behaviour before, and
> could offer some sort of explanation or advice.

When you have multiple resources, adding active processes increases
overall throughput until roughly the point when you can keep them
all busy.  Once you hit that point, adding more processes to contend
for the resources just adds overhead and blocking.  HT is so bad
because it tends to cause context switch storms, but context
switching becomes an issue even without it.  The other main issue is
lock contention.  Beyond a certain point, processes start to contend
for lightweight locks, so you might context switch to a process only
to find that it's still blocked and you have to switch again to try
the next process, until you finally find one which can make
progress.  To acquire the lightweight lock you first need to acquire
a spinlock, so as things get busier processes start eating lots of
CPU in the spinlock loops trying to get to the point of being able
to check the LW locks to see if they're available.

You clearly got the best performance with all 32 cores and 16 to 32
processes active.  I don't know why you were hitting the knee sooner
than I've seen in my benchmarks, but the principle is the same.  Use
a connection pool which limits how many transactions are active,
such that you don't exceed 32 processes busy at the same time, and
make sure that it queues transaction requests beyond that so that a
new transaction can be started promptly when you are at your limit
and a transaction completes.

-Kevin

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:

--- On Mon, 11/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> From: Scott Marlowe <scott.marlowe@gmail.com>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: "Glyn Astill" <glynastill@yahoo.co.uk>
> Cc: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, "Joshua D. Drake" <jd@commandprompt.com>,
pgsql-performance@postgresql.org
> Date: Monday, 11 April, 2011, 21:52
> On Mon, Apr 11, 2011 at 1:42 PM, Glyn
> Astill <glynastill@yahoo.co.uk>
> wrote:
>
> > A wild guess is something like multiple cores
> contending for cpu cache, cpu affinity, or some kind of
> contention in the kernel, alas a little out of my depth.
> >
> > It's pretty sickening to think I can't get anything
> else out of more than 8 cores.
>
> Have you tried running the memory stream benchmark Greg
> Smith had
> posted here a while back?  It'll let you know if
> you're memory is
> bottlenecking.  Right now my 48 core machines are the
> king of that
> benchmark with something like 70+Gig a second.
>

No I haven't, but I will first thing tomorow morning.  I did run a sysbench memory write test though, if I recall
correctlythat gave me somewhere just over 3000 Mb/s 



Re: Linux: more cores = less concurrency.

From
James Cloos
Date:
>>>>> "GA" == Glyn Astill <glynastill@yahoo.co.uk> writes:

GA> I was hoping someone had seen this sort of behaviour before,
GA> and could offer some sort of explanation or advice.

Jesper's reply is probably most on point as to the reason.

I know that recent Opterons use some of their cache to better manage
cache-coherency.  I presum recent Xeons do so, too, but perhaps yours
are not recent enough for that?

-JimC
--
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

Re: Linux: more cores = less concurrency.

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> I don't know why you were hitting the knee sooner than I've seen
> in my benchmarks

If you're compiling your own executable, you might try boosting
LOG2_NUM_LOCK_PARTITIONS (defined in lwlocks.h) to 5 or 6.  The
current value of 4 means that there are 16 partitions to spread
contention for the lightweight locks which protect the heavyweight
locking, and this corresponds to your best throughput point.  It
might be instructive to see what happens when you tweak the number
of partitions.

Also, if you can profile PostgreSQL at the sweet spot and again at a
pessimal load, comparing the profiles should give good clues about
the points of contention.

-Kevin

Re: Linux: more cores = less concurrency.

From
David Rees
Date:
On Mon, Apr 11, 2011 at 6:04 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>
> What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on
fewercores. 

The X7550 have "Turbo Boost" which means they will overclock to 2.4
GHz from 2.0 GHz when not all cores are in use per-die.  I don't know
if it's possible to monitor this, but I think you can disable "Turbo
Boost" in bios for further testing.

The E5320 CPUs in your old servers doesn't appear "Turbo Boost".

-Dave

Re: Linux: more cores = less concurrency.

From
"mark"
Date:

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Scott Marlowe
> Sent: Monday, April 11, 2011 1:29 PM
> To: Glyn Astill
> Cc: Kevin Grittner; Joshua D. Drake; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
>
> On Mon, Apr 11, 2011 at 12:23 PM, Glyn Astill <glynastill@yahoo.co.uk>
> wrote:
> >
> >
> > --- On Mon, 11/4/11, Joshua D. Drake <jd@commandprompt.com> wrote:
> >
> >> From: Joshua D. Drake <jd@commandprompt.com>
> >> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> >> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> >> Cc: pgsql-performance@postgresql.org, "Glyn Astill"
> <glynastill@yahoo.co.uk>
> >> Date: Monday, 11 April, 2011, 19:12
> >> On Mon, 11 Apr 2011 13:09:15 -0500,
> >> "Kevin Grittner"
> >> <Kevin.Grittner@wicourts.gov>
> >> wrote:
> >> > Glyn Astill <glynastill@yahoo.co.uk>
> >> wrote:
> >> >
> >> >> The new server uses 4 x 8 core Xeon X7550 CPUs at
> >> 2Ghz
> >> >
> >> > Which has hyperthreading.
> >> >
> >> >> our current servers are 2 x 4 core Xeon E5320 CPUs
> >> at 2Ghz.
> >> >
> >> > Which doesn't have hyperthreading.
> >> >
> >
> > Yep, off. If you look at the benchmarks I took, HT absoloutely killed
> it.
> >
> >> > PostgreSQL often performs worse with hyperthreading
> >> than without.
> >> > Have you turned HT off on your new machine?  If
> >> not, I would start
> >> > there.
> >>
> >> And then make sure you aren't running CFQ.
> >>
> >> JD
> >>
> >
> > Not running CFQ, running the no-op i/o scheduler.
>
> Just FYI, in synthetic pgbench type benchmarks, a 48 core AMD Magny
> Cours with LSI HW RAID and 34 15k6 Hard drives scales almost linearly
> up to 48 or so threads, getting into the 7000+ tps range.  With SW
> RAID it gets into the 5500 tps range.

Just wondering, which LSI card ?
Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some
other config?


-M


>
> --
> Sent via pgsql-performance mailing list (pgsql-
> performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote:
> Just wondering, which LSI card ?
> Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some
> other config?

We were using teh LSI8888 but I'll be switching back to Areca when we
go back to HW RAID.  The LSI8888 only performed well if we setup 15
RAID-1 pairs in HW and use linux SW RAID 0 on top.  RAID1+0 in the
LSI8888 was a pretty mediocre performer.  Areca 1680 OTOH, beats it in
every test, with HW RAID10 only.  Much simpler to admin.

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 6:18 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote:
>> Just wondering, which LSI card ?
>> Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or some
>> other config?
>
> We were using teh LSI8888 but I'll be switching back to Areca when we
> go back to HW RAID.  The LSI8888 only performed well if we setup 15
> RAID-1 pairs in HW and use linux SW RAID 0 on top.  RAID1+0 in the
> LSI8888 was a pretty mediocre performer.  Areca 1680 OTOH, beats it in
> every test, with HW RAID10 only.  Much simpler to admin.

And it was RAID-10 w 4 drives for pg_xlog and RAID-10 with 24 drives
for the data store.  Both controllers, and pure SW when the LSI8888s
cooked inside the poorly cooled Supermicro 1U we had it in.

Re: Linux: more cores = less concurrency.

From
"mark"
Date:

> -----Original Message-----
> From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
> Sent: Monday, April 11, 2011 6:18 PM
> To: mark
> Cc: Glyn Astill; Kevin Grittner; Joshua D. Drake; pgsql-
> performance@postgresql.org
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
>
> On Mon, Apr 11, 2011 at 6:05 PM, mark <dvlhntr@gmail.com> wrote:
> > Just wondering, which LSI card ?
> > Was this 32 drives in Raid 1+0 with a two drive raid 1 for logs or
> some
> > other config?
>
> We were using teh LSI8888 but I'll be switching back to Areca when we
> go back to HW RAID.  The LSI8888 only performed well if we setup 15
> RAID-1 pairs in HW and use linux SW RAID 0 on top.  RAID1+0 in the
> LSI8888 was a pretty mediocre performer.  Areca 1680 OTOH, beats it in
> every test, with HW RAID10 only.  Much simpler to admin.

Interesting, thanks for sharing.

I guess I have never gotten to the point where I felt I needed more than 2
drives for my xlogs. Maybe I have been dismissing that as a possibility
something. (my biggest array is only 24 SFF drives tho)

I am trying to get my hands on a dual core lsi card for testing at work.
(either a 9265-8i or 9285-8e) don't see any dual core 6Gbps SAS Areca cards
yet....still rocking a Arcea 1130 at home tho.


-M


Re: Linux: more cores = less concurrency.

From
Merlin Moncure
Date:
On Mon, Apr 11, 2011 at 5:06 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>
>> The issue I'm seeing is that 8 real cores outperform 16 real
>> cores, which outperform 32 real cores under high concurrency.
>
> With every benchmark I've done of PostgreSQL, the "knee" in the
> performance graph comes right around ((2 * cores) +
> effective_spindle_count).  With the database fully cached (as I
> believe you mentioned), effective_spindle_count is zero.  If you
> don't use a connection pool to limit active transactions to the
> number from that formula, performance drops off.  The more CPUs you
> have, the sharper the drop after the knee.

I was about to say something similar with some canned advice to use a
connection pooler to control this.  However, OP scaling is more or
less topping out at cores / 4...yikes!.  Here are my suspicions in
rough order:

1. There is scaling problem in client/network/etc.  Trivially
disproved, convert the test to pgbench -f and post results
2. The test is in fact i/o bound. Scaling is going to be
hardware/kernel determined.  Can we see iostat/vmstat/top snipped
during test run?  Maybe no-op is burning you?
3. Locking/concurrency issue in heavy_seat_function() (source for
that?)  how much writing does it do?

Can we see some iobound and cpubound pgbench runs on both servers?

merlin

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 6:50 PM, mark <dvlhntr@gmail.com> wrote:
>
> Interesting, thanks for sharing.
>
> I guess I have never gotten to the point where I felt I needed more than 2
> drives for my xlogs. Maybe I have been dismissing that as a possibility
> something. (my biggest array is only 24 SFF drives tho)
>
> I am trying to get my hands on a dual core lsi card for testing at work.
> (either a 9265-8i or 9285-8e) don't see any dual core 6Gbps SAS Areca cards
> yet....still rocking a Arcea 1130 at home tho.

Make doubly sure whatever machine you're putting it in moves plenty of
air across it's PCI cards.  They make plenty of heat.  the Areca 1880
are the 6GB/s cards, don't know if they're single or dual core.  The
LSI interface and command line tools are so horribly designed and the
performance was so substandard I've pretty much given up on them.
Maybe the newer cards are better, but the 9xxx series wouldn't get
along with my motherboard so it was the 8888 or Areca.

As for pg_xlog, with 4 drives in a RAID-10 we were hitting a limit
with only two drives in RAID-1 against 24 drives in the RAID-10 for
the data store in our mixed load.  And we use an old 12xx series Areca
at work for our primary file server and it's been super reliable for
the two years it's been running.

Re: Linux: more cores = less concurrency.

From
Jesper Krogh
Date:
On 2011-04-11 22:39, James Cloos wrote:
>>>>>> "GA" == Glyn Astill<glynastill@yahoo.co.uk>  writes:
> GA>  I was hoping someone had seen this sort of behaviour before,
> GA>  and could offer some sort of explanation or advice.
>
> Jesper's reply is probably most on point as to the reason.
>
> I know that recent Opterons use some of their cache to better manage
> cache-coherency.  I presum recent Xeons do so, too, but perhaps yours
> are not recent enough for that?

Better cache-coherence also benefits, but it does nothing to
the fact that remote DRAM fetches is way more expensive
than local ones. (Hard numbers to get excact nowadays).

--
Jesper

Re: Linux: more cores = less concurrency.

From
Scott Marlowe
Date:
On Mon, Apr 11, 2011 at 7:04 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
> Hi Guys,
>
> I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql
functionto allocate seats) concurrently.  We do use connection pooling and split out some selects to slony slaves, but
thetests here are primeraly to test what an individual server is capable of. 
>
> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>
> What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on
fewercores. 

O man, I completely forgot the issue I ran into in my machines, and
that was that zone_reclaim completely screwed postgresql and file
system performance.  On machines with more CPU nodes and higher
internode cost it gets turned on automagically and destroys
performance for machines that use a lot of kernel cache / shared
memory.

Be sure and use sysctl.conf to turn it off:

vm.zone_reclaim_mode = 0

Re: Linux: more cores = less concurrency.

From
Arjen van der Meijden
Date:
On 11-4-2011 22:04 david@lang.hm wrote:
> in your case, try your new servers without hyperthreading. you will end
> up with a 4x4 core system, which should handily outperform the 2x4 core
> system you are replacing.
>
> the limit isn't 8 cores, it's that the hyperthreaded cores don't work
> well with the postgres access patterns.

It would be really weird if disabling HT would turn these 8-core cpu's
in 4-core cpu's ;) They have 8 physical cores and 16 threads each. So he
basically has a 32-core machine with 64 threads in total (if HT were
enabled). Still, HT may or may not improve things, back when we had time
to benchmark new systems we had one of the first HT-Xeon's (a dual 5080,
with two cores + HT each) available:
http://ic.tweakimg.net/ext/i/1155958729.png

The blue lines are all slightly above the orange/red lines. So back then
HT slightly improved our read-mostly Postgresql benchmark score.

We also did benchmarks with Sun's UltraSparc T2 back then:
http://ic.tweakimg.net/ext/i/1214930814.png

Adding full cores (including threads) made things much better, but we
also tested full cores with more threads each:
http://ic.tweakimg.net/ext/i/1214930816.png

As you can see, with that benchmark, it was better to have 4 cores with
8 threads each, than 8 cores with 2 threads each.

The T2-threads where much heavier duty than the HT-threads back then,
but afaik Intel has improved its technology with this re-introduction of
them quite a bit.

So I wouldn't dismiss hyper threading for a read-mostly Postgresql
workload too easily.

Then again, keeping 32 cores busy, without them contending for every
resource will already be quite hard. So adding 32 additional "threads"
may indeed make matters much worse.

Best regards,

Arjen

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote:

> >> The issue I'm seeing is that 8 real cores
> outperform 16 real
> >> cores, which outperform 32 real cores under high
> concurrency.
> >
> > With every benchmark I've done of PostgreSQL, the
> "knee" in the
> > performance graph comes right around ((2 * cores) +
> > effective_spindle_count).  With the database fully
> cached (as I
> > believe you mentioned), effective_spindle_count is
> zero.  If you
> > don't use a connection pool to limit active
> transactions to the
> > number from that formula, performance drops off.  The
> more CPUs you
> > have, the sharper the drop after the knee.
>
> I was about to say something similar with some canned
> advice to use a
> connection pooler to control this.  However, OP
> scaling is more or
> less topping out at cores / 4...yikes!.  Here are my
> suspicions in
> rough order:
>
> 1. There is scaling problem in client/network/etc. 
> Trivially
> disproved, convert the test to pgbench -f and post results
> 2. The test is in fact i/o bound. Scaling is going to be
> hardware/kernel determined.  Can we see
> iostat/vmstat/top snipped
> during test run?  Maybe no-op is burning you?

This is during my 80 clients test, this is a point at which the performance is well below that of the same machine
limitedto 8 cores. 

http://www.privatepaste.com/dc131ff26e

> 3. Locking/concurrency issue in heavy_seat_function()
> (source for
> that?)  how much writing does it do?
>

No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data
either,but all clients are after the same data in the test - maybe theres some locks there? 

> Can we see some iobound and cpubound pgbench runs on both
> servers?
>

Of course, I'll post when I've gotten to that.


Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Tue, 12/4/11, Scott Marlowe <scott.marlowe@gmail.com> wrote:

> From: Scott Marlowe <scott.marlowe@gmail.com>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: "Glyn Astill" <glynastill@yahoo.co.uk>
> Cc: pgsql-performance@postgresql.org
> Date: Tuesday, 12 April, 2011, 6:55
> On Mon, Apr 11, 2011 at 7:04 AM, Glyn
> Astill <glynastill@yahoo.co.uk>
> wrote:
> > Hi Guys,
> >
> > I'm just doing some tests on a new server running one
> of our heavy select functions (the select part of a plpgsql
> function to allocate seats) concurrently.  We do use
> connection pooling and split out some selects to slony
> slaves, but the tests here are primeraly to test what an
> individual server is capable of.
> >
> > The new server uses 4 x 8 core Xeon X7550 CPUs at
> 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at
> 2Ghz.
> >
> > What I'm seeing is when the number of clients is
> greater than the number of cores, the new servers perform
> better on fewer cores.
>
> O man, I completely forgot the issue I ran into in my
> machines, and
> that was that zone_reclaim completely screwed postgresql
> and file
> system performance.  On machines with more CPU nodes
> and higher
> internode cost it gets turned on automagically and
> destroys
> performance for machines that use a lot of kernel cache /
> shared
> memory.
>
> Be sure and use sysctl.conf to turn it off:
>
> vm.zone_reclaim_mode = 0
>

I've made this change, not seen any immediate changes however it's good to know. Thanks Scott.

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Mon, 11/4/11, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> From: Kevin Grittner <Kevin.Grittner@wicourts.gov>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: david@lang.hm, "Steve Clark" <sclark@netwolves.com>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>, "Glyn
Astill"<glynastill@yahoo.co.uk> 
> Cc: "Joshua D. Drake" <jd@commandprompt.com>, "Scott Marlowe" <scott.marlowe@gmail.com>,
pgsql-performance@postgresql.org
> Date: Monday, 11 April, 2011, 22:35
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> wrote:
>
> > I don't know why you were hitting the knee sooner than
> I've seen
> > in my benchmarks
>
> If you're compiling your own executable, you might try
> boosting
> LOG2_NUM_LOCK_PARTITIONS (defined in lwlocks.h) to 5 or
> 6.  The
> current value of 4 means that there are 16 partitions to
> spread
> contention for the lightweight locks which protect the
> heavyweight
> locking, and this corresponds to your best throughput
> point.  It
> might be instructive to see what happens when you tweak the
> number
> of partitions.
>

Tried tweeking LOG2_NUM_LOCK_PARTITIONS between 5 and 7. My results took a dive when I changed to 32 partitions, and
improvedas I increaced to 128, but appeared to be happiest at the default of 16. 

> Also, if you can profile PostgreSQL at the sweet spot and
> again at a
> pessimal load, comparing the profiles should give good
> clues about
> the points of contention.
>

Results for the same machine on 8 and 32 cores are here:

http://www.8kb.co.uk/server_benchmarks/dblt_results.csv

Here's the sweet spot for 32 cores, and the 8 core equivalent:

http://www.8kb.co.uk/server_benchmarks/iostat-32cores_32Clients.txt
http://www.8kb.co.uk/server_benchmarks/vmstat-32cores_32Clients.txt

http://www.8kb.co.uk/server_benchmarks/iostat-8cores_32Clients.txt
http://www.8kb.co.uk/server_benchmarks/vmstat-8cores_32Clients.txt

... and at the pessimal load for 32 cores, and the 8 core equivalent:

http://www.8kb.co.uk/server_benchmarks/iostat-32cores_100Clients.txt
http://www.8kb.co.uk/server_benchmarks/vmstat-32cores_100Clients.txt

http://www.8kb.co.uk/server_benchmarks/iostat-8cores_100Clients.txt
http://www.8kb.co.uk/server_benchmarks/vmstat-8cores_100Clients.txt

vmstat shows double the context switches on 32 cores, could this be a factor? Is there anything else I'm missing there?

Cheers
Glyn

Re: Linux: more cores = less concurrency.

From
Merlin Moncure
Date:
On Tue, Apr 12, 2011 at 3:54 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
> --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> >> The issue I'm seeing is that 8 real cores
>> outperform 16 real
>> >> cores, which outperform 32 real cores under high
>> concurrency.
>> >
>> > With every benchmark I've done of PostgreSQL, the
>> "knee" in the
>> > performance graph comes right around ((2 * cores) +
>> > effective_spindle_count).  With the database fully
>> cached (as I
>> > believe you mentioned), effective_spindle_count is
>> zero.  If you
>> > don't use a connection pool to limit active
>> transactions to the
>> > number from that formula, performance drops off.  The
>> more CPUs you
>> > have, the sharper the drop after the knee.
>>
>> I was about to say something similar with some canned
>> advice to use a
>> connection pooler to control this.  However, OP
>> scaling is more or
>> less topping out at cores / 4...yikes!.  Here are my
>> suspicions in
>> rough order:
>>
>> 1. There is scaling problem in client/network/etc.
>> Trivially
>> disproved, convert the test to pgbench -f and post results
>> 2. The test is in fact i/o bound. Scaling is going to be
>> hardware/kernel determined.  Can we see
>> iostat/vmstat/top snipped
>> during test run?  Maybe no-op is burning you?
>
> This is during my 80 clients test, this is a point at which the performance is well below that of the same machine
limitedto 8 cores. 
>
> http://www.privatepaste.com/dc131ff26e
>
>> 3. Locking/concurrency issue in heavy_seat_function()
>> (source for
>> that?)  how much writing does it do?
>>
>
> No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data
either,but all clients are after the same data in the test - maybe theres some locks there? 
>
>> Can we see some iobound and cpubound pgbench runs on both
>> servers?
>>
>
> Of course, I'll post when I've gotten to that.

Ok, there's no writing going on -- so the i/o tets aren't necessary.
Context switches are also not too high -- the problem is likely in
postgres or on your end.

However, I Would still like to see:
pgbench select only tests:
pgbench -i -s 1
pgbench -S -c 8 -t 500
pgbench -S -c 32 -t 500
pgbench -S -c 80 -t 500

pgbench -i -s 500
pgbench -S -c 8 -t 500
pgbench -S -c 32 -t 500
pgbench -S -c 80 -t 500

write out bench.sql with:
begin;
select * from heavy_seat_function();
select * from heavy_seat_function();
commit;

pgbench -n bench.sql -c 8 -t 500
pgbench -n bench.sql -c 8 -t 500
pgbench -n bench.sql -c 8 -t 500

I'm still suspecting an obvious problem here.  One thing we may have
overlooked is that you are connecting and disconnecting one per
benchmarking step (two query executions).  If you have heavy RSA
encryption enabled on connection establishment, this could eat you.

If pgbench results confirm your scaling problems and our issue is not
in the general area of connection establishment, it's time to break
out the profiler :/.

merlin

Re: Linux: more cores = less concurrency.

From
Merlin Moncure
Date:
On Tue, Apr 12, 2011 at 8:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Apr 12, 2011 at 3:54 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
>> --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>>> >> The issue I'm seeing is that 8 real cores
>>> outperform 16 real
>>> >> cores, which outperform 32 real cores under high
>>> concurrency.
>>> >
>>> > With every benchmark I've done of PostgreSQL, the
>>> "knee" in the
>>> > performance graph comes right around ((2 * cores) +
>>> > effective_spindle_count).  With the database fully
>>> cached (as I
>>> > believe you mentioned), effective_spindle_count is
>>> zero.  If you
>>> > don't use a connection pool to limit active
>>> transactions to the
>>> > number from that formula, performance drops off.  The
>>> more CPUs you
>>> > have, the sharper the drop after the knee.
>>>
>>> I was about to say something similar with some canned
>>> advice to use a
>>> connection pooler to control this.  However, OP
>>> scaling is more or
>>> less topping out at cores / 4...yikes!.  Here are my
>>> suspicions in
>>> rough order:
>>>
>>> 1. There is scaling problem in client/network/etc.
>>> Trivially
>>> disproved, convert the test to pgbench -f and post results
>>> 2. The test is in fact i/o bound. Scaling is going to be
>>> hardware/kernel determined.  Can we see
>>> iostat/vmstat/top snipped
>>> during test run?  Maybe no-op is burning you?
>>
>> This is during my 80 clients test, this is a point at which the performance is well below that of the same machine
limitedto 8 cores. 
>>
>> http://www.privatepaste.com/dc131ff26e
>>
>>> 3. Locking/concurrency issue in heavy_seat_function()
>>> (source for
>>> that?)  how much writing does it do?
>>>
>>
>> No writing afaik - its a select with a few joins and subqueries - I'm pretty sure it's not writing out temp data
either,but all clients are after the same data in the test - maybe theres some locks there? 
>>
>>> Can we see some iobound and cpubound pgbench runs on both
>>> servers?
>>>
>>
>> Of course, I'll post when I've gotten to that.
>
> Ok, there's no writing going on -- so the i/o tets aren't necessary.
> Context switches are also not too high -- the problem is likely in
> postgres or on your end.
>
> However, I Would still like to see:
> pgbench select only tests:
> pgbench -i -s 1
> pgbench -S -c 8 -t 500
> pgbench -S -c 32 -t 500
> pgbench -S -c 80 -t 500
>
> pgbench -i -s 500
> pgbench -S -c 8 -t 500
> pgbench -S -c 32 -t 500
> pgbench -S -c 80 -t 500
>
> write out bench.sql with:
> begin;
> select * from heavy_seat_function();
> select * from heavy_seat_function();
> commit;
>
> pgbench -n bench.sql -c 8 -t 500
> pgbench -n bench.sql -c 8 -t 500
> pgbench -n bench.sql -c 8 -t 500

whoops:
pgbench -n bench.sql -c 8 -t 500
pgbench -n bench.sql -c 32 -t 500
pgbench -n bench.sql -c 80 -t 500

merlin

Re: Linux: more cores = less concurrency.

From
"Kevin Grittner"
Date:
Glyn Astill <glynastill@yahoo.co.uk> wrote:

> Tried tweeking LOG2_NUM_LOCK_PARTITIONS between 5 and 7. My
> results took a dive when I changed to 32 partitions, and improved
> as I increaced to 128, but appeared to be happiest at the default
> of 16.

Good to know.

>> Also, if you can profile PostgreSQL at the sweet spot and again
>> at a pessimal load, comparing the profiles should give good clues
>> about the points of contention.

> [iostat and vmstat output]

Wow, zero idle and zero wait, and single digit for system.  Did you
ever run those RAM speed tests?  (I don't remember seeing results
for that -- or failed to recognize them.)  At this point, my best
guess at this point is that you don't have the bandwidth to RAM to
support the CPU power.  Databases tend to push data around in RAM a
lot.

When I mentioned profiling, I was thinking more of oprofile or
something like it.  If it were me, I'd be going there by now.

-Kevin

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote:

> >>> Can we see some iobound and cpubound pgbench
> runs on both
> >>> servers?
> >>>
> >>
> >> Of course, I'll post when I've gotten to that.
> >
> > Ok, there's no writing going on -- so the i/o tets
> aren't necessary.
> > Context switches are also not too high -- the problem
> is likely in
> > postgres or on your end.
> >
> > However, I Would still like to see:
> > pgbench select only tests:
> > pgbench -i -s 1
> > pgbench -S -c 8 -t 500
> > pgbench -S -c 32 -t 500
> > pgbench -S -c 80 -t 500
> >
> > pgbench -i -s 500
> > pgbench -S -c 8 -t 500
> > pgbench -S -c 32 -t 500
> > pgbench -S -c 80 -t 500
> >
> > write out bench.sql with:
> > begin;
> > select * from heavy_seat_function();
> > select * from heavy_seat_function();
> > commit;
> >
> > pgbench -n bench.sql -c 8 -t 500
> > pgbench -n bench.sql -c 8 -t 500
> > pgbench -n bench.sql -c 8 -t 500
>
> whoops:
> pgbench -n bench.sql -c 8 -t 500
> pgbench -n bench.sql -c 32 -t 500
> pgbench -n bench.sql -c 80 -t 500
>
> merlin
>

Right, here they are:

http://www.privatepaste.com/3dd777f4db



Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Tue, 12/4/11, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> Wow, zero idle and zero wait, and single digit for
> system.  Did you
> ever run those RAM speed tests?  (I don't remember
> seeing results
> for that -- or failed to recognize them.)  At this
> point, my best
> guess at this point is that you don't have the bandwidth to
> RAM to
> support the CPU power.  Databases tend to push data
> around in RAM a
> lot.

I mentioned sysbench was giving me something like 3000 MB/sec on memory write tests, but nothing more.

Results from Greg Smiths stream_scaling test are here:

http://www.privatepaste.com/4338aa1196

>
> When I mentioned profiling, I was thinking more of oprofile
> or
> something like it.  If it were me, I'd be going there
> by now.
>

Advice taken, it'll be my next step.

Glyn

Re: Linux: more cores = less concurrency.

From
Merlin Moncure
Date:
On Tue, Apr 12, 2011 at 11:01 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote:
> --- On Tue, 12/4/11, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> >>> Can we see some iobound and cpubound pgbench
>> runs on both
>> >>> servers?
>> >>>
>> >>
>> >> Of course, I'll post when I've gotten to that.
>> >
>> > Ok, there's no writing going on -- so the i/o tets
>> aren't necessary.
>> > Context switches are also not too high -- the problem
>> is likely in
>> > postgres or on your end.
>> >
>> > However, I Would still like to see:
>> > pgbench select only tests:
>> > pgbench -i -s 1
>> > pgbench -S -c 8 -t 500
>> > pgbench -S -c 32 -t 500
>> > pgbench -S -c 80 -t 500
>> >
>> > pgbench -i -s 500
>> > pgbench -S -c 8 -t 500
>> > pgbench -S -c 32 -t 500
>> > pgbench -S -c 80 -t 500
>> >
>> > write out bench.sql with:
>> > begin;
>> > select * from heavy_seat_function();
>> > select * from heavy_seat_function();
>> > commit;
>> >
>> > pgbench -n bench.sql -c 8 -t 500
>> > pgbench -n bench.sql -c 8 -t 500
>> > pgbench -n bench.sql -c 8 -t 500
>>
>> whoops:
>> pgbench -n bench.sql -c 8 -t 500
>> pgbench -n bench.sql -c 32 -t 500
>> pgbench -n bench.sql -c 80 -t 500
>>
>> merlin
>>
>
> Right, here they are:
>
> http://www.privatepaste.com/3dd777f4db

your results unfortunately confirmed the worst -- no easy answers on
this one :(.  Before breaking out the profiler, can you take some
random samples of:

select count(*) from pg_stat_activity where waiting;

to see if you have any locking issues?
Also, are you sure your function executions are relatively free of
side effects?
I can take a look at the code off list if you'd prefer to keep it discrete.

merlin

Re: Linux: more cores = less concurrency.

From
"Kevin Grittner"
Date:
Glyn Astill <glynastill@yahoo.co.uk> wrote:

> Results from Greg Smiths stream_scaling test are here:
>
> http://www.privatepaste.com/4338aa1196

Well, that pretty much clinches it.  Your RAM access tops out at 16
processors.  It appears that your processors are spending most of
their time waiting for and contending for the RAM bus.

I have gotten machines in where moving a jumper, flipping a DIP
switch, or changing BIOS options from the default made a big
difference.  I'd be looking at the manuals for my motherboard and
BIOS right now to see what options there might be to improve that.

-Kevin

Re: Linux: more cores = less concurrency.

From
Claudio Freire
Date:
On Tue, Apr 12, 2011 at 6:40 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>
> Well, that pretty much clinches it.  Your RAM access tops out at 16
> processors.  It appears that your processors are spending most of
> their time waiting for and contending for the RAM bus.

It tops, but it doesn't drop.

I'd propose that the perceived drop in TPS is due to cache contention
- ie, more processes fighting for the scarce cache means less
efficient use of the (constant upwards of 16 processes) bandwidth.

So... the solution would be to add more servers, rather than just sockets.
(or a server with more sockets *and* more bandwidth)

Re: Linux: more cores = less concurrency.

From
"F. BROUARD / SQLpro"
Date:
Hi,

I think that a NUMA architecture machine can solve the problem....

A +
Le 11/04/2011 15:04, Glyn Astill a écrit :
>
> Hi Guys,
>
> I'm just doing some tests on a new server running one of our heavy select functions (the select part of a plpgsql
functionto allocate seats) concurrently.  We do use connection pooling and split out some selects to slony slaves, but
thetests here are primeraly to test what an individual server is capable of. 
>
> The new server uses 4 x 8 core Xeon X7550 CPUs at 2Ghz, our current servers are 2 x 4 core Xeon E5320 CPUs at 2Ghz.
>
> What I'm seeing is when the number of clients is greater than the number of cores, the new servers perform better on
fewercores. 
>
> Has anyone else seen this behaviour?  I'm guessing this is either a hardware limitation or something to do with linux
processmanagement / scheduling? Any idea what to look into? 
>
> My benchmark utility is just using a little .net/npgsql app that runs increacing numbers of clients concurrently,
eachclient runs a specified number of iterations of any sql I specify. 
>
> I've posted some results and the test program here:
>
> http://www.8kb.co.uk/server_benchmarks/
>
>


--
Frédéric BROUARD - expert SGBDR et SQL - MVP SQL Server - 06 11 86 40 66
Le site sur le langage SQL et les SGBDR  :  http://sqlpro.developpez.com
Enseignant Arts & Métiers PACA, ISEN Toulon et CESI/EXIA Aix en Provence
Audit, conseil, expertise, formation, modélisation, tuning, optimisation
*********************** http://www.sqlspot.com *************************


Re: Linux: more cores = less concurrency.

From
Greg Smith
Date:
Kevin Grittner wrote:
> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>
>
>> Results from Greg Smiths stream_scaling test are here:
>>
>> http://www.privatepaste.com/4338aa1196
>>
>
> Well, that pretty much clinches it.  Your RAM access tops out at 16
> processors.  It appears that your processors are spending most of
> their time waiting for and contending for the RAM bus.
>

I've pulled Glyn's results into
https://github.com/gregs1104/stream-scaling so they're easy to compare
against similar processors, his system is the one labled 4 X X7550.  I'm
hearing this same story from multiple people lately:  these 32+ core
servers bottleneck on aggregate memory speed with running PostgreSQL
long before the CPUs are fully utilized.  This server is close to
maximum memory utilization at 8 cores, and the small increase in gross
throughput above that doesn't seem to be making up for the loss in L1
and L2 thrashing from trying to run more.  These systems with many cores
can only be used fully if you have a program that can work efficiency
some of the time with just local CPU resources.  That's very rarely the
case for a database that's moving 8K pages, tuple caches, and other
forms of working memory around all the time.


> I have gotten machines in where moving a jumper, flipping a DIP
> switch, or changing BIOS options from the default made a big
> difference.  I'd be looking at the manuals for my motherboard and
> BIOS right now to see what options there might be to improve that

I already forwarded Glyn a good article about tuning these Dell BIOSs in
particular from an interesting blog series others here might like too:

http://bleything.net/articles/postgresql-benchmarking-memory.html

Ben Bleything is doing a very thorough walk-through of server hardware
validation, and as is often the case he's already found one major
problem with the vendor config he had to fix to get expected results.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Linux: more cores = less concurrency.

From
Greg Smith
Date:
Scott Marlowe wrote:
> Have you tried running the memory stream benchmark Greg Smith had
> posted here a while back?  It'll let you know if you're memory is
> bottlenecking.  Right now my 48 core machines are the king of that
> benchmark with something like 70+Gig a second.
>

The big Opterons are still the front-runners here, but not with 70GB/s
anymore.  Earlier versions of stream-scaling didn't use nearly enough
data to avoid L3 cache in the processors interfering with results.  More
recent tests I've gotten in done after I expanded the default test size
for them show the Opterons normally hitting the same ~35GB/s maximum
throughput that the Intel processors get out of similar DDR3/1333 sets.
There are some outliers where >50GB/s still shows up.  I'm not sure if I
really believe them though; attempts to increase the test size now hit a
32-bit limit inside stream.c, and I think that's not really big enough
to avoid L3 cache effects here.

In the table at https://github.com/gregs1104/stream-scaling the 4 X 6172
server is similar to Scott's system.  I believe the results for 8
(37613) and 48 cores (32301) there.  I remain somewhat suspicious that
the higher reuslts of 40 - 51GB/s shown between 16 and 32 cores may be
inflated by caching.  At this point I'll probably need direct access to
one of them to resolve this for sure.  I've made a lot of progress with
other people's servers, but complete trust in those particular results
still isn't there yet.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Linux: more cores = less concurrency.

From
Merlin Moncure
Date:
On Tue, Apr 12, 2011 at 12:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>>
>> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>>
>>>
>>> Results from Greg Smiths stream_scaling test are here:
>>>
>>> http://www.privatepaste.com/4338aa1196
>>>
>>
>>  Well, that pretty much clinches it.  Your RAM access tops out at 16
>> processors.  It appears that your processors are spending most of
>> their time waiting for and contending for the RAM bus.
>>
>
> I've pulled Glyn's results into https://github.com/gregs1104/stream-scaling
> so they're easy to compare against similar processors, his system is the one
> labled 4 X X7550.  I'm hearing this same story from multiple people lately:
>  these 32+ core servers bottleneck on aggregate memory speed with running
> PostgreSQL long before the CPUs are fully utilized.  This server is close to
> maximum memory utilization at 8 cores, and the small increase in gross
> throughput above that doesn't seem to be making up for the loss in L1 and L2
> thrashing from trying to run more.  These systems with many cores can only
> be used fully if you have a program that can work efficiency some of the
> time with just local CPU resources.  That's very rarely the case for a
> database that's moving 8K pages, tuple caches, and other forms of working
> memory around all the time.
>
>
>> I have gotten machines in where moving a jumper, flipping a DIP
>> switch, or changing BIOS options from the default made a big
>> difference.  I'd be looking at the manuals for my motherboard and
>> BIOS right now to see what options there might be to improve that
>
> I already forwarded Glyn a good article about tuning these Dell BIOSs in
> particular from an interesting blog series others here might like too:
>
> http://bleything.net/articles/postgresql-benchmarking-memory.html
>
> Ben Bleything is doing a very thorough walk-through of server hardware
> validation, and as is often the case he's already found one major problem
> with the vendor config he had to fix to get expected results.

For posterity, since it looks like you guys have nailed this one, I
took a look at some of the code off list and I can confirm there is no
obvious bottleneck coming from locking type issues.  The functions are
'stable' as implemented with no fancy tricks.

merlin

Re: Linux: more cores = less concurrency.

From
"Strange, John W"
Date:
When purchasing the intel 7500 series, please make sure to check the hemisphere mode of your memory configuration.
Thereis a HUGE difference in the memory configuration around 50% speed if you don't populate all the memory slots on
thecontrollers properly.
 

https://globalsp.ts.fujitsu.com/dmsp/docs/wp-nehalem-ex-memory-performance-ww-en.pdf

- John

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin
Moncure
Sent: Tuesday, April 12, 2011 12:14 PM
To: Greg Smith
Cc: Kevin Grittner; david@lang.hm; Steve Clark; Glyn Astill; Joshua D. Drake; Scott Marlowe;
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Linux: more cores = less concurrency.

On Tue, Apr 12, 2011 at 12:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>>
>> Glyn Astill <glynastill@yahoo.co.uk> wrote:
>>
>>>
>>> Results from Greg Smiths stream_scaling test are here:
>>>
>>> http://www.privatepaste.com/4338aa1196
>>>
>>
>>  Well, that pretty much clinches it.  Your RAM access tops out at 16 
>> processors.  It appears that your processors are spending most of 
>> their time waiting for and contending for the RAM bus.
>>
>
> I've pulled Glyn's results into 
> https://github.com/gregs1104/stream-scaling
> so they're easy to compare against similar processors, his system is 
> the one labled 4 X X7550.  I'm hearing this same story from multiple people lately:
>  these 32+ core servers bottleneck on aggregate memory speed with 
> running PostgreSQL long before the CPUs are fully utilized.  This 
> server is close to maximum memory utilization at 8 cores, and the 
> small increase in gross throughput above that doesn't seem to be 
> making up for the loss in L1 and L2 thrashing from trying to run more.  
> These systems with many cores can only be used fully if you have a 
> program that can work efficiency some of the time with just local CPU 
> resources.  That's very rarely the case for a database that's moving 
> 8K pages, tuple caches, and other forms of working memory around all the time.
>
>
>> I have gotten machines in where moving a jumper, flipping a DIP 
>> switch, or changing BIOS options from the default made a big 
>> difference.  I'd be looking at the manuals for my motherboard and 
>> BIOS right now to see what options there might be to improve that
>
> I already forwarded Glyn a good article about tuning these Dell BIOSs 
> in particular from an interesting blog series others here might like too:
>
> http://bleything.net/articles/postgresql-benchmarking-memory.html
>
> Ben Bleything is doing a very thorough walk-through of server hardware 
> validation, and as is often the case he's already found one major 
> problem with the vendor config he had to fix to get expected results.

For posterity, since it looks like you guys have nailed this one, I took a look at some of the code off list and I can
confirmthere is no obvious bottleneck coming from locking type issues.  The functions are 'stable' as implemented with
nofancy tricks.
 


merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted as to completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments are believed to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any way from its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety, whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.

Re: Linux: more cores = less concurrency.

From
Glyn Astill
Date:
--- On Tue, 12/4/11, Greg Smith <greg@2ndquadrant.com> wrote:

> From: Greg Smith <greg@2ndquadrant.com>
> Subject: Re: [PERFORM] Linux: more cores = less concurrency.
> To: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> Cc: david@lang.hm, "Steve Clark" <sclark@netwolves.com>, "Glyn Astill" <glynastill@yahoo.co.uk>, "Joshua D. Drake"
<jd@commandprompt.com>,"Scott Marlowe" <scott.marlowe@gmail.com>, pgsql-performance@postgresql.org 
> Date: Tuesday, 12 April, 2011, 18:00
> Kevin Grittner wrote:
> > Glyn Astill <glynastill@yahoo.co.uk>
> wrote:
> >   
> >> Results from Greg Smiths stream_scaling test are
> here:
> >>
> >> http://www.privatepaste.com/4338aa1196
> >>     
> >  Well, that pretty much clinches it.  Your
> RAM access tops out at 16
> > processors.  It appears that your processors are
> spending most of
> > their time waiting for and contending for the RAM
> bus.
> >   
>
> I've pulled Glyn's results into https://github.com/gregs1104/stream-scaling so they're
> easy to compare against similar processors, his system is
> the one labled 4 X X7550.  I'm hearing this same story
> from multiple people lately:  these 32+ core servers
> bottleneck on aggregate memory speed with running PostgreSQL
> long before the CPUs are fully utilized.  This server
> is close to maximum memory utilization at 8 cores, and the
> small increase in gross throughput above that doesn't seem
> to be making up for the loss in L1 and L2 thrashing from
> trying to run more.  These systems with many cores can
> only be used fully if you have a program that can work
> efficiency some of the time with just local CPU
> resources.  That's very rarely the case for a database
> that's moving 8K pages, tuple caches, and other forms of
> working memory around all the time.
>
>
> > I have gotten machines in where moving a jumper,
> flipping a DIP
> > switch, or changing BIOS options from the default made
> a big
> > difference.  I'd be looking at the manuals for my
> motherboard and
> > BIOS right now to see what options there might be to
> improve that
>
> I already forwarded Glyn a good article about tuning these
> Dell BIOSs in particular from an interesting blog series
> others here might like too:
>
> http://bleything.net/articles/postgresql-benchmarking-memory.html
>
> Ben Bleything is doing a very thorough walk-through of
> server hardware validation, and as is often the case he's
> already found one major problem with the vendor config he
> had to fix to get expected results.
>

Thanks Greg.  I've been through that post, but unfortunately there's no settings that make a difference.

However upon further investigation and looking at the manual for the R910 here

http://support.dell.com/support/edocs/systems/per910/en/HOM/HTML/install.htm#wp1266264

I've discovered we only have 4 of the 8 memory risers, and the manual states that in this configuration we are running
in"Power Optimized" mode, rather than "Performance Optimized". 

We've got two of these machines, so I've just pulled all the risers from one system, removed half the memory as
indicatedby that document from Dell above, and now I'm seeing almost double the throughput. 



Re: Linux: more cores = less concurrency.

From
Scott Carey
Date:
If postgres is memory bandwidth constrained, what can be done to reduce
its bandwidth use?

Huge Pages could help some, by reducing page table lookups and making
overall access more efficient.
Compressed pages (speedy / lzo) in memory can help trade CPU cycles for
memory usage for certain memory segments/pages -- this could potentially
save a lot of I/O too if more pages fit in RAM as a result, and also make
caches more effective.

As I've noted before, the optimizer inappropriately choses the larger side
of a join to hash instead of the smaller one in many cases on hash joins,
which is less cache efficient.
Dual-pivot quicksort is more cache firendly than Postgres' single pivit
one and uses less memory bandwidth on average (fewer swaps, but the same
number of compares).



On 4/13/11 2:48 AM, "Glyn Astill" <glynastill@yahoo.co.uk> wrote:

>--- On Tue, 12/4/11, Greg Smith <greg@2ndquadrant.com> wrote:
>
>>
>>
>
>Thanks Greg.  I've been through that post, but unfortunately there's no
>settings that make a difference.
>
>However upon further investigation and looking at the manual for the R910
>here
>
>http://support.dell.com/support/edocs/systems/per910/en/HOM/HTML/install.h
>tm#wp1266264
>
>I've discovered we only have 4 of the 8 memory risers, and the manual
>states that in this configuration we are running in "Power Optimized"
>mode, rather than "Performance Optimized".
>
>We've got two of these machines, so I've just pulled all the risers from
>one system, removed half the memory as indicated by that document from
>Dell above, and now I'm seeing almost double the throughput.
>
>
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance


Re: Linux: more cores = less concurrency.

From
Greg Smith
Date:
Scott Carey wrote:
> If postgres is memory bandwidth constrained, what can be done to reduce
> its bandwidth use?
>
> Huge Pages could help some, by reducing page table lookups and making
> overall access more efficient.
> Compressed pages (speedy / lzo) in memory can help trade CPU cycles for
> memory usage for certain memory segments/pages -- this could potentially
> save a lot of I/O too if more pages fit in RAM as a result, and also make
> caches more effective.
>

The problem with a lot of these ideas is that they trade the memory
problem for increased disruption to the CPU L1 and L2 caches.  I don't
know how much that moves the bottleneck forward.  And not every workload
is memory constrained, either, so those that aren't might suffer from
the same optimizations that help in this situation.

I just posted my slides from my MySQL conference talk today at
http://projects.2ndquadrant.com/talks , and those include some graphs of
recent data collected with stream-scaling.  The current situation is
really strange in both Intel and AMD's memory architectures.  I'm even
seeing situations where lightly loaded big servers are actually
outperformed by small ones running the same workload.  The 32 and 48
core systems using server-class DDR3/1333 just don't have the bandwidth
to a single core that, say, an i7 desktop using triple-channel DDR3-1600
does.  The trade-offs here are extremely hardware and workload
dependent, and it's very easy to tune for one combination while slowing
another.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Linux: more cores = less concurrency.

From
Florian Weimer
Date:
* Jesper Krogh:

> If you have a 1 socket system, all of your data can be fetched from
> "local" ram seen from you cpu, on a 2 socket, 50% of your accesses
> will be "way slower", 4 socket even worse.

There are non-NUMA multi-socket systems, so this doesn't apply in all
cases.  (The E5320-based system is likely non-NUMA.)

Speaking about NUMA, do you know if there are some non-invasive tools
which can be used to monitor page migration and off-node memory
accesses?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: Linux: more cores = less concurrency.

From
Cédric Villemain
Date:
2011/4/14 Florian Weimer <fweimer@bfk.de>:
> * Jesper Krogh:
>
>> If you have a 1 socket system, all of your data can be fetched from
>> "local" ram seen from you cpu, on a 2 socket, 50% of your accesses
>> will be "way slower", 4 socket even worse.
>
> There are non-NUMA multi-socket systems, so this doesn't apply in all
> cases.  (The E5320-based system is likely non-NUMA.)
>
> Speaking about NUMA, do you know if there are some non-invasive tools
> which can be used to monitor page migration and off-node memory
> accesses?

I am unsure it is exactly what you are looking for, but linux do
provide access to counters in:
/sys/devices/system/node/node*/numastat

I also find usefull to check meminfo per node instead of via /proc


--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Linux: more cores = less concurrency.

From
Scott Carey
Date:

On 4/13/11 9:23 PM, "Greg Smith" <greg@2ndquadrant.com> wrote:

>Scott Carey wrote:
>> If postgres is memory bandwidth constrained, what can be done to reduce
>> its bandwidth use?
>>
>> Huge Pages could help some, by reducing page table lookups and making
>> overall access more efficient.
>> Compressed pages (speedy / lzo) in memory can help trade CPU cycles for
>> memory usage for certain memory segments/pages -- this could potentially
>> save a lot of I/O too if more pages fit in RAM as a result, and also
>>make
>> caches more effective.
>>
>
>The problem with a lot of these ideas is that they trade the memory
>problem for increased disruption to the CPU L1 and L2 caches.  I don't
>know how much that moves the bottleneck forward.  And not every workload
>is memory constrained, either, so those that aren't might suffer from
>the same optimizations that help in this situation.

Compression has this problem, but I'm not sure where the plural "a lot of
these ideas" comes from.

Huge Pages helps caches.
Dual-Pivot quicksort is more cache friendly and is _always_ equal to or
faster than traditional quicksort (its a provably improved algorithm).
Smaller hash tables help caches.

>
>I just posted my slides from my MySQL conference talk today at
>http://projects.2ndquadrant.com/talks , and those include some graphs of
>recent data collected with stream-scaling.  The current situation is
>really strange in both Intel and AMD's memory architectures.  I'm even
>seeing situations where lightly loaded big servers are actually
>outperformed by small ones running the same workload.  The 32 and 48
>core systems using server-class DDR3/1333 just don't have the bandwidth
>to a single core that, say, an i7 desktop using triple-channel DDR3-1600
>does.  The trade-offs here are extremely hardware and workload
>dependent, and it's very easy to tune for one combination while slowing
>another.
>
>--
>Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
>PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
>"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>


Re: Linux: more cores = less concurrency.

From
Claudio Freire
Date:
On Thu, Apr 14, 2011 at 10:05 PM, Scott Carey <scott@richrelevance.com> wrote:
> Huge Pages helps caches.
> Dual-Pivot quicksort is more cache friendly and is _always_ equal to or
> faster than traditional quicksort (its a provably improved algorithm).

If you want a cache-friendly sorting algorithm, you need mergesort.

I don't know any algorithm as friendly to caches as mergesort.

Quicksort could be better only when the sorting buffer is guaranteed
to fit on the CPU's cache, and that's usually just a few 4kb pages.

Re: Linux: more cores = less concurrency.

From
Scott Carey
Date:
On 4/14/11 1:19 PM, "Claudio Freire" <klaussfreire@gmail.com> wrote:

>On Thu, Apr 14, 2011 at 10:05 PM, Scott Carey <scott@richrelevance.com>
>wrote:
>> Huge Pages helps caches.
>> Dual-Pivot quicksort is more cache friendly and is _always_ equal to or
>> faster than traditional quicksort (its a provably improved algorithm).
>
>If you want a cache-friendly sorting algorithm, you need mergesort.
>
>I don't know any algorithm as friendly to caches as mergesort.
>
>Quicksort could be better only when the sorting buffer is guaranteed
>to fit on the CPU's cache, and that's usually just a few 4kb pages.

Of mergesort variants, Timsort is a recent general purpose variant favored
by many since it is sub- O(n log(n)) on partially sorted data.

Which work best under which circumstances depends a lot on the size of the
data, size of the elements, cost of the compare function, whether you're
sorting the data directly or sorting pointers, and other factors.

Mergesort may be more cache friendly (?) but might use more memory
bandwidth.  I'm not sure.

I do know that dual-pivot quicksort provably causes fewer swaps (but the
same # of compares) as the usual single-pivot quicksort.  And swaps are a
lot slower than you would expect due to the effects on processor caches.
Therefore it might help with multiprocessor scalability by reducing
memory/cache pressure.


Re: Linux: more cores = less concurrency.

From
Claudio Freire
Date:
On Fri, Apr 15, 2011 at 12:42 AM, Scott Carey <scott@richrelevance.com> wrote:
> I do know that dual-pivot quicksort provably causes fewer swaps (but the
> same # of compares) as the usual single-pivot quicksort.  And swaps are a
> lot slower than you would expect due to the effects on processor caches.
> Therefore it might help with multiprocessor scalability by reducing
> memory/cache pressure.

I agree, and it's quite non-disruptive - ie, a drop-in replacement for
quicksort, whereas mergesort or timsort both require bigger changes
and heavier profiling.