Thread: SMP scaling

SMP scaling

From
Mark Rae
Date:
On Tue, Mar 15, 2005 at 07:00:25PM -0500, Bruce Momjian wrote:
> Oh, you have to try CVS HEAD or a nightly snapshot.  Tom made a major
> change that allows scaling in SMP environments.

Ok, I've done the tests comparing 8.0.1 against a snapshot from the 16th
and the results are impressive.
As well as the 16CPU Altix, I've done comparisons on two other 4CPU
machines which previously didn't scale as well as expected.


Clients                1      2      3      4      6      8     12     16     32     64
---------------------------------------------------------------------------------------
Altix pg-8.0.1      1.00   2.02   2.98   3.97   5.87   7.23   7.51   5.54   4.68   5.10
Altix pg-20050316   1.00   1.97   2.86   3.68   5.29   6.90   9.00   9.88  10.03   9.94
AMD64 pg-8.0.1      1.00   1.87   2.77   3.34   2.73   2.57   2.58   2.62
AMD64 pg-20050316   1.00   1.95   2.84   3.69   3.61   3.66   3.70   3.69
IA64  pg-8.0.1      1.00   1.97   2.91   3.82   2.91   2.92   2.94   2.98
IA64  pg-20050316   1.00   1.98   2.95   3.87   3.80   3.78   3.86   3.90

Altix == 16x 1.6GHz Itanium2    192GB memory
AMD64 ==  4x 2.2GHz Opteron 848   8GB memory
IA64  ==  4x 1.5GHz Itanium2     16GB memory


The altix still only scales up to 10x rather than 16x, but that probably
is the NUMA configuration taking effect now.
Also this machine isn't set up to run databases, so only has 1 FC I/O card,
which means a CPU can end up being 4 hops away from the memory and disk.

As the database is so small (8GB), relative to the machine, the data will
be on average 2 hops away. This gives an average of 72% of the speed of
local memory, based on previous measurements of speed vs hops.

So getting 63% of the theoretical maximum database throughput is pretty good.


    -Mark

Re: SMP scaling

From
Tom Lane
Date:
Mark Rae <mrae@purplebat.com> writes:
> Ok, I've done the tests comparing 8.0.1 against a snapshot from the 16th
> and the results are impressive.

> Clients                1      2      3      4      6      8     12     16     32     64
> ---------------------------------------------------------------------------------------
> Altix pg-8.0.1      1.00   2.02   2.98   3.97   5.87   7.23   7.51   5.54   4.68   5.10
> Altix pg-20050316   1.00   1.97   2.86   3.68   5.29   6.90   9.00   9.88  10.03   9.94
> AMD64 pg-8.0.1      1.00   1.87   2.77   3.34   2.73   2.57   2.58   2.62
> AMD64 pg-20050316   1.00   1.95   2.84   3.69   3.61   3.66   3.70   3.69
> IA64  pg-8.0.1      1.00   1.97   2.91   3.82   2.91   2.92   2.94   2.98
> IA64  pg-20050316   1.00   1.98   2.95   3.87   3.80   3.78   3.86   3.90


Hey, that looks pretty sweet.  One thing this obscures though is whether
there is any change in the single-client throughput rate --- ie, is "1.00"
better or worse for CVS tip vs 8.0.1?

            regards, tom lane

Re: SMP scaling

From
Mark Rae
Date:
On Fri, Mar 18, 2005 at 10:38:24AM -0500, Tom Lane wrote:
> Hey, that looks pretty sweet.  One thing this obscures though is whether
> there is any change in the single-client throughput rate --- ie, is "1.00"
> better or worse for CVS tip vs 8.0.1?

Here are the figures in queries per second.

Clients                1      2      3      4      6      8     12     16     32     64
---------------------------------------------------------------------------------------
AMD64 pg-8.0.1      6.80  12.71  18.82  22.73  18.58  17.48  17.56  17.81
AMD64 pg-20050316   6.80  13.23  19.32  25.09  24.56  24.93  25.20  25.09
IA64  pg-8.0.1      3.72   7.32  10.81  14.21  10.81  10.85  10.92  11.09
IA64  pg-20050316   3.99   7.92  11.78  15.46  15.17  15.09  15.41  15.58
Altix pg-8.0.1      3.66   7.37  10.89  14.53  21.47  26.47  27.47  20.28  17.12  18.66
Altix pg-20050316   3.83   7.55  10.98  14.10  20.27  26.47  34.50  37.88  38.45  38.12

So, it didn't make any difference for the Opteron, but the two
Itanium machines were 5% and 7% faster respectively.

    -Mark

Re: SMP scaling

From
Bruce Momjian
Date:
Mark Rae wrote:
> On Fri, Mar 18, 2005 at 10:38:24AM -0500, Tom Lane wrote:
> > Hey, that looks pretty sweet.  One thing this obscures though is whether
> > there is any change in the single-client throughput rate --- ie, is "1.00"
> > better or worse for CVS tip vs 8.0.1?
>
> Here are the figures in queries per second.
>
> Clients                1      2      3      4      6      8     12     16     32     64
> ---------------------------------------------------------------------------------------
> AMD64 pg-8.0.1      6.80  12.71  18.82  22.73  18.58  17.48  17.56  17.81
> AMD64 pg-20050316   6.80  13.23  19.32  25.09  24.56  24.93  25.20  25.09
> IA64  pg-8.0.1      3.72   7.32  10.81  14.21  10.81  10.85  10.92  11.09
> IA64  pg-20050316   3.99   7.92  11.78  15.46  15.17  15.09  15.41  15.58
> Altix pg-8.0.1      3.66   7.37  10.89  14.53  21.47  26.47  27.47  20.28  17.12  18.66
> Altix pg-20050316   3.83   7.55  10.98  14.10  20.27  26.47  34.50  37.88  38.45  38.12
>
> So, it didn't make any difference for the Opteron, but the two
> Itanium machines were 5% and 7% faster respectively.

So it seems our entire SMP problem was that global lock.  Nice.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: SMP scaling

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> So it seems our entire SMP problem was that global lock.  Nice.

Yeah, I was kind of expecting to see the LockMgrLock up next, but
it seems we're still a ways away from having a problem there.  I guess
that's because we only tend to touch locks once per query, whereas
we're grabbing and releasing buffers much more.

From the relatively small absolute value of Mark's queries/sec numbers,
I suppose he is testing some fairly heavyweight queries (big enough
to not emphasize per-query overhead).  I wonder what the numbers would
look like with very small, simple queries.  It'd move the stress around
for sure ...

            regards, tom lane

Re: SMP scaling

From
Tom Lane
Date:
Mark Rae <mrae@purplebat.com> writes:
> The altix still only scales up to 10x rather than 16x, but that probably
> is the NUMA configuration taking effect now.

BTW, although I know next to nothing about NUMA, I do know that it is
configurable to some extent (eg, via numactl).  What was the
configuration here exactly, and did you try alternatives?  Also,
what was the OS exactly?  (I've heard that RHEL4 is a whole lot better
than RHEL3 in managing NUMA, for example.  This may be generic to 2.6 vs
2.4 Linux kernels, or maybe Red Hat did some extra hacking.)

            regards, tom lane

Re: SMP scaling

From
Mark Rae
Date:
On Fri, Mar 18, 2005 at 01:31:51PM -0500, Tom Lane wrote:
> BTW, although I know next to nothing about NUMA, I do know that it is
> configurable to some extent (eg, via numactl).  What was the
> configuration here exactly, and did you try alternatives?  Also,
> what was the OS exactly?  (I've heard that RHEL4 is a whole lot better
> than RHEL3 in managing NUMA, for example.  This may be generic to 2.6 vs
> 2.4 Linux kernels, or maybe Red Hat did some extra hacking.)

The Altix uses a 2.4.21 kernel with SGI's own modifications
to support up to 256 CPUs and their NUMALink hadware.
(Some of which has become the NUMA code in the 2.6 kernel)

Even with the numa support, which makes sure any memory allocated by
malloc or the stack ends up local to the processor which originally
called it, and then continues to schedule the process on that CPU,
there is still the problem that all table accesses* go through
the shared buffer cache which resides in one location.
[* is this true in all cases?]

I was about to write a long explaination about how the only way
to scale out to this size would be to have separate buffer caches in
each memory domain, and this would then require some kind of cache
coherency mechanism. But after reading a few bits of documentation,
it looks like SGI already have a solution in the form of
symmetric data objects.

In particular, the symmetric heap, an area of shared memory
which is replicated across all memory domains with the
coherency being handled in hardware.

So it looks like all that might be needed is to replace the
shmget calls in src/backend/port with the equivalent SGI functions.

    -Mark

Re: SMP scaling

From
Tom Lane
Date:
Mark Rae <mrae@purplebat.com> writes:
> Even with the numa support, which makes sure any memory allocated by
> malloc or the stack ends up local to the processor which originally
> called it, and then continues to schedule the process on that CPU,
> there is still the problem that all table accesses* go through
> the shared buffer cache which resides in one location.
> [* is this true in all cases?]

Temp tables are handled in backend-local memory, but all normal tables
have to be accessed through shared buffers.  The implications of not
doing that are bad enough that it's hard to believe it could be a win
to change.  (In short: the hardware may not like syncing across
processors, but it can still do it faster than we could hope to do in
software.)

> it looks like SGI already have a solution in the form of
> symmetric data objects.
> In particular, the symmetric heap, an area of shared memory
> which is replicated across all memory domains with the
> coherency being handled in hardware.

Hmm, do they support spinlock objects in this memory?  If so it could be
just the right thing.

> So it looks like all that might be needed is to replace the
> shmget calls in src/backend/port with the equivalent SGI functions.

Send a patch ;-)

            regards, tom lane