Thread: Spinlocks and CPU Architectures

Spinlocks and CPU Architectures

From

Simon Riggs

Date:

11 October 2005, 09:17:08

The long history of spinlock issues has recently been attacked
significantly by Tom, but I wanted to get a status on this issue before
we release 8.1

My understanding of the problems of spinlocking has been greatly
enhanced by two recent articles:

Linux Journal, discussing linux SMP portability issues across multiple
CPU architectures
http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212

A similar blog from an MS guy, who has been looking into porting the MS
CLR across to multiple CPUs, rather than just x86 arch & derivatives.
This article was the nearest I could find to an article I found in the
most recent copy of the MS Developer Journal, that discussed multi-
processor synchronisation techniques.
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx

The conclusion I draw from all of this is that the spinlock code needs
to be specialised for individual hardware architectures, not just
instruction sets. That is, we need different code for Intel and AMD, as
well as for other CPU types - *if* you want it to perform well for SMP.

>From the experience with Xeon and Xeon MP, it also seems to be the case
that different CPU variants and implementations cause different memory
access behaviours in SMP systems to single/dual CPUs.

That's bad news for portability and for producing binaries, but we
should not use that as a reason to avoid facing up to the situation. The
rest of the world does seem to try quite hard to limit portability and
the articles above show that others must also be experiencing the same
difficulties.

Do other people reach the same conclusions?

Can we make a list of those architectures for which 8.1 is known to
perform reasonably well, with reasonable SMP scalability? I suggest that
we record this list somewhere in the release notes, but with a comment
to say we run on other architectures, but they have not been SMP tested
as of date of announcement. That is important, since the release notes
make specific claim about scalability features.

Best Regards, Simon Riggs

Re: Spinlocks and CPU Architectures

From

Emil Briggs

Date:

11 October 2005, 10:00:16

>
> Do other people reach the same conclusions?
>
> Can we make a list of those architectures for which 8.1 is known to
> perform reasonably well, with reasonable SMP scalability? I suggest that
> we record this list somewhere in the release notes, but with a comment
> to say we run on other architectures, but they have not been SMP tested
> as of date of announcement. That is important, since the release notes
> make specific claim about scalability features.
>

I posted a message last week about some tests with Tom's recent spinlock 
patches on a quad opteron server running Suse 9.2. I found that the patches 
helped a great deal when the concurrency level was less than or equal to the 
number of processors. When it was greater than that they didn't help nearly 
as much and in fact at high concurrency levels the application would run 
about as fast running on a single processor as on a quad. It was better than 
without the patches but that's not what I could call good scalability on this 
architecture.

Emil

P.S. I did put it into production last week since the gain when the 
concurrency level was <= 4 was so pronounced and it appears to be working 
well.

Re: Spinlocks and CPU Architectures

From

Tom Lane

Date:

11 October 2005, 12:12:52

Simon Riggs <simon@2ndquadrant.com> writes:
> The long history of spinlock issues has recently been attacked
> significantly by Tom, but I wanted to get a status on this issue before
> we release 8.1

I'd still like to do something more with that before we release, but
exactly what is TBD.

> The conclusion I draw from all of this is that the spinlock code needs
> to be specialised for individual hardware architectures, not just
> instruction sets. That is, we need different code for Intel and AMD, as
> well as for other CPU types - *if* you want it to perform well for SMP.

This seems pretty unworkable from a packaging standpoint.  Even if you
teach autoconf how to tell which model it's running on, there's no
guarantee that the resulting executables will be used on that same
machine.  We would have to make a run-time test, and I do not think that
that idea is attractive either --- adding a conditional branch to the
spinlock code will likely negate whatever performance improvement we
could hope to get.

As far as the x86_64 TAS code is concerned, my inclination is to remove
the cmpb test; that's a significant win on Opteron and only a small loss
on EM64T.  There is not evidence to justify removing cmpb on x86
architecture, but we can easily split the x86 and x86_64 cases.

I'd also like to apply some form of the adaptive-spin-delay patch that
was discussed last month.
        regards, tom lane

Re: Spinlocks and CPU Architectures

From

Martijn van Oosterhout

Date:

11 October 2005, 12:37:04

On Tue, Oct 11, 2005 at 11:12:46AM -0400, Tom Lane wrote:
> This seems pretty unworkable from a packaging standpoint.  Even if you
> teach autoconf how to tell which model it's running on, there's no
> guarantee that the resulting executables will be used on that same
> machine.  We would have to make a run-time test, and I do not think that
> that idea is attractive either --- adding a conditional branch to the
> spinlock code will likely negate whatever performance improvement we
> could hope to get.

Well, as long as the code you've got works on all the systems you
expect, you have the choice. If you start getting to the point where
there is no single piece of code that works across all the expected
systems, then you have an issue.

I don't think we're there yet, but I don't think using a function
pointer would be all that expensive?

Performence measuring I guess...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Spinlocks and CPU Architectures

From

Peter Eisentraut

Date:

11 October 2005, 13:46:40

Tom Lane wrote:
> This seems pretty unworkable from a packaging standpoint.  Even if
> you teach autoconf how to tell which model it's running on, there's
> no guarantee that the resulting executables will be used on that same
> machine.

A number of packages in the video area (and perhaps others) do compile 
"sub-architecture" specific variants.  This could be done for 
PostgreSQL, but you'd probably need to show some pretty convincing 
performance numbers before people start the packaging effort.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Spinlocks and CPU Architectures

From

Simon Riggs

Date:

11 October 2005, 14:55:58

On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote:
> Tom Lane wrote:
> > This seems pretty unworkable from a packaging standpoint.  Even if
> > you teach autoconf how to tell which model it's running on, there's
> > no guarantee that the resulting executables will be used on that same
> > machine.
> 
> A number of packages in the video area (and perhaps others) do compile 
> "sub-architecture" specific variants.  This could be done for 
> PostgreSQL, but you'd probably need to show some pretty convincing 
> performance numbers before people start the packaging effort.

I completely agree, just note that we already have some cases where
convincing performance numbers exist. 

Tom is suggesting having different behaviour for x86 and x86_64. The x86
will still run on x86_64 architecture would it not? So we'll have two
binaries for each OS, yes?

In general, where we do find a clear difference, we should at very least
identify/record which variant the binary is most suitable for. At best
we could produce different executables, but I understand the packaging
effort required to do that.

Best Regards, Simon Riggs

Re: Spinlocks and CPU Architectures

From

"Dann Corbit"

Date:

11 October 2005, 15:09:36

As an aside, here is a package that has recently been BSD re-licensed:
http://sourceforge.net/projects/libltx/

It is a lightweight memory transaction package.  It comes with a paper
entitled "Cache Sensitive Software Transactional Memory" by Robert
Ennals.

In the paper, Robert Ennals suggests this form of concurrent programming
as a replacement for lock based programming.  A quote:
"We have now reached the point where transactions are outperforming
locks -- and people are starting to get interested."

There are a number of interesting claims in the paper.  Since the
license is now compatible, it may have some interest for integration
into the PostgreSQL core where appropriate.

It would certainly be worthwhile to read the paper and fool around with
the supplied test driver to compare the approaches.

If nobody on the PostgreSQL team has time for the experimentations, it
might be a good project for a PhD candidate at some university.

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Simon Riggs
> Sent: Tuesday, October 11, 2005 10:56 AM
> To: Peter Eisentraut
> Cc: pgsql-hackers@postgresql.org; Tom Lane
> Subject: Re: [HACKERS] Spinlocks and CPU Architectures
>
> On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote:
> > Tom Lane wrote:
> > > This seems pretty unworkable from a packaging standpoint.  Even if
> > > you teach autoconf how to tell which model it's running on,
there's
> > > no guarantee that the resulting executables will be used on that
same
> > > machine.
> >
> > A number of packages in the video area (and perhaps others) do
compile
> > "sub-architecture" specific variants.  This could be done for
> > PostgreSQL, but you'd probably need to show some pretty convincing
> > performance numbers before people start the packaging effort.
>
> I completely agree, just note that we already have some cases where
> convincing performance numbers exist.
>
> Tom is suggesting having different behaviour for x86 and x86_64. The
x86
> will still run on x86_64 architecture would it not? So we'll have two
> binaries for each OS, yes?
>
> In general, where we do find a clear difference, we should at very
least
> identify/record which variant the binary is most suitable for. At best
> we could produce different executables, but I understand the packaging
> effort required to do that.
>
> Best Regards, Simon Riggs
>
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: Spinlocks and CPU Architectures

From

Tom Lane

Date:

11 October 2005, 15:28:11

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote:
>> A number of packages in the video area (and perhaps others) do compile 
>> "sub-architecture" specific variants.  This could be done for 
>> PostgreSQL, but you'd probably need to show some pretty convincing 
>> performance numbers before people start the packaging effort.

> I completely agree, just note that we already have some cases where
> convincing performance numbers exist. 

I'm not sure the data we have is convincing enough to justify
sub-architecture packaging.  We have a somewhat-artificial test case
that's designed to generate the maximum possible spinlock contention,
and we can measure a significant difference in that context --- but
that's a long way from saying that a similar percentage difference would
be seen in real-world applications.  It's also a long way from saying
that the effort needed to set this up wouldn't be better repaid working
on other problems.

One thought that comes to mind is that these decisions are probably
comparable to those made by gcc conditional on -march flags.  Do we
get access to the -march setting by means of predefined symbols?
If so we could compile different TAS code for opteron and em64t without
introducing any packaging issues that didn't exist already.  It would
essentially be up to the source-code builder which sub-arch to tune for.

(Since the assembly code in question currently works only for gcc
anyway, I'm not too concerned about making gcc-specific assumptions
about the availability of flag macros.)
        regards, tom lane

Re: Spinlocks and CPU Architectures

From

Martijn van Oosterhout

Date:

11 October 2005, 16:10:06

On Tue, Oct 11, 2005 at 02:28:02PM -0400, Tom Lane wrote:
> One thought that comes to mind is that these decisions are probably
> comparable to those made by gcc conditional on -march flags.  Do we
> get access to the -march setting by means of predefined symbols?
> If so we could compile different TAS code for opteron and em64t without
> introducing any packaging issues that didn't exist already.  It would
> essentially be up to the source-code builder which sub-arch to tune for.

The option is available see below. It appears __tune_xxx__ is set for
the -mcpu option and __xxx__ is set for the -march option. This is gcc
3.3.5, but it probably works for older versions...

-march controls which instructions to use, -mcpu is which scheduling to
assume. Which do we use? The former implies the latter but not the
other way round.

Have a nice day,

kleptog@vali:/tmp$ rm a.h
kleptog@vali:/tmp$ touch a.h
kleptog@vali:/tmp$ gcc -mcpu=i386 -E -dM a.h >/tmp/a
kleptog@vali:/tmp$ gcc -mcpu=pentium -E -dM a.h >/tmp/b
kleptog@vali:/tmp$ diff -u0 /tmp/a /tmp/b
--- /tmp/a      2005-10-11 21:02:38.000000000 +0200
+++ /tmp/b      2005-10-11 21:02:41.000000000 +0200
@@ -28,0 +29 @@
+#define __tune_pentium__ 1
@@ -44,0 +46 @@
+#define __tune_i586__ 1
@@ -61 +62,0 @@
-#define __tune_i386__ 1
kleptog@vali:/tmp$ gcc -march=i386 -E -dM a.h >/tmp/a
kleptog@vali:/tmp$ gcc -march=athlon -E -dM a.h >/tmp/b
kleptog@vali:/tmp$ diff -u0 /tmp/a /tmp/b
--- /tmp/a      2005-10-11 21:03:57.000000000 +0200
+++ /tmp/b      2005-10-11 21:05:47.000000000 +0200
@@ -4,0 +5 @@
+#define __3dNOW_A__ 1
@@ -16,0 +18 @@
+#define __athlon 1
@@ -41,0 +44 @@
+#define __MMX__ 1
@@ -46,0 +50 @@
+#define __athlon__ 1
@@ -59 +62,0 @@
-#define __tune_i386__ 1
@@ -63,0 +67,2 @@
+#define __tune_athlon__ 1
+#define __3dNOW__ 1
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Spinlocks and CPU Architectures

From

Peter Eisentraut

Date:

11 October 2005, 16:21:21

Simon Riggs wrote:
> Tom is suggesting having different behaviour for x86 and x86_64. The
> x86 will still run on x86_64 architecture would it not? So we'll have
> two binaries for each OS, yes?

A quick glance around tells me that most free operating systems are 
treating x86 and x86_64 as separate platforms anyway, so having 
different code for each is not a problem.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Spinlocks and CPU Architectures

From

Tom Lane

Date:

11 October 2005, 16:59:13

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Oct 11, 2005 at 02:28:02PM -0400, Tom Lane wrote:
>> One thought that comes to mind is that these decisions are probably
>> comparable to those made by gcc conditional on -march flags.  Do we
>> get access to the -march setting by means of predefined symbols?

> The option is available see below. It appears __tune_xxx__ is set for
> the -mcpu option and __xxx__ is set for the -march option. This is gcc
> 3.3.5, but it probably works for older versions...

Actually, after reviewing the thread from last month, I was
misremembering: *all* of the test cases we had for x86_64 showed the
cmpb to be a loss.  It was only on plain x86 that there was some
difference of results for that patch.  So I think we should just remove
the cmpb unconditionally for x86_64, and be done with it.  We can leave
the x86 case alone, at least for now, because there didn't seem to be
any cases of big wins there.

The reason I was confused was that the related patch to vary the spinlock
delay loop count seemed to be a win on Opteron but a loss on EM64T.
This probably means that we need a smarter algorithm for varying the
loop count --- the upper limit has to be configurable, perhaps.
        regards, tom lane