Thread: Spinlocks and CPU Architectures
The long history of spinlock issues has recently been attacked significantly by Tom, but I wanted to get a status on this issue before we release 8.1 My understanding of the problems of spinlocking has been greatly enhanced by two recent articles: Linux Journal, discussing linux SMP portability issues across multiple CPU architectures http://www.linuxjournal.com/article/8211 http://www.linuxjournal.com/article/8212 A similar blog from an MS guy, who has been looking into porting the MS CLR across to multiple CPUs, rather than just x86 arch & derivatives. This article was the nearest I could find to an article I found in the most recent copy of the MS Developer Journal, that discussed multi- processor synchronisation techniques. http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx The conclusion I draw from all of this is that the spinlock code needs to be specialised for individual hardware architectures, not just instruction sets. That is, we need different code for Intel and AMD, as well as for other CPU types - *if* you want it to perform well for SMP. >From the experience with Xeon and Xeon MP, it also seems to be the case that different CPU variants and implementations cause different memory access behaviours in SMP systems to single/dual CPUs. That's bad news for portability and for producing binaries, but we should not use that as a reason to avoid facing up to the situation. The rest of the world does seem to try quite hard to limit portability and the articles above show that others must also be experiencing the same difficulties. Do other people reach the same conclusions? Can we make a list of those architectures for which 8.1 is known to perform reasonably well, with reasonable SMP scalability? I suggest that we record this list somewhere in the release notes, but with a comment to say we run on other architectures, but they have not been SMP tested as of date of announcement. That is important, since the release notes make specific claim about scalability features. Best Regards, Simon Riggs
> > Do other people reach the same conclusions? > > Can we make a list of those architectures for which 8.1 is known to > perform reasonably well, with reasonable SMP scalability? I suggest that > we record this list somewhere in the release notes, but with a comment > to say we run on other architectures, but they have not been SMP tested > as of date of announcement. That is important, since the release notes > make specific claim about scalability features. > I posted a message last week about some tests with Tom's recent spinlock patches on a quad opteron server running Suse 9.2. I found that the patches helped a great deal when the concurrency level was less than or equal to the number of processors. When it was greater than that they didn't help nearly as much and in fact at high concurrency levels the application would run about as fast running on a single processor as on a quad. It was better than without the patches but that's not what I could call good scalability on this architecture. Emil P.S. I did put it into production last week since the gain when the concurrency level was <= 4 was so pronounced and it appears to be working well.
Simon Riggs <simon@2ndquadrant.com> writes: > The long history of spinlock issues has recently been attacked > significantly by Tom, but I wanted to get a status on this issue before > we release 8.1 I'd still like to do something more with that before we release, but exactly what is TBD. > The conclusion I draw from all of this is that the spinlock code needs > to be specialised for individual hardware architectures, not just > instruction sets. That is, we need different code for Intel and AMD, as > well as for other CPU types - *if* you want it to perform well for SMP. This seems pretty unworkable from a packaging standpoint. Even if you teach autoconf how to tell which model it's running on, there's no guarantee that the resulting executables will be used on that same machine. We would have to make a run-time test, and I do not think that that idea is attractive either --- adding a conditional branch to the spinlock code will likely negate whatever performance improvement we could hope to get. As far as the x86_64 TAS code is concerned, my inclination is to remove the cmpb test; that's a significant win on Opteron and only a small loss on EM64T. There is not evidence to justify removing cmpb on x86 architecture, but we can easily split the x86 and x86_64 cases. I'd also like to apply some form of the adaptive-spin-delay patch that was discussed last month. regards, tom lane
On Tue, Oct 11, 2005 at 11:12:46AM -0400, Tom Lane wrote: > This seems pretty unworkable from a packaging standpoint. Even if you > teach autoconf how to tell which model it's running on, there's no > guarantee that the resulting executables will be used on that same > machine. We would have to make a run-time test, and I do not think that > that idea is attractive either --- adding a conditional branch to the > spinlock code will likely negate whatever performance improvement we > could hope to get. Well, as long as the code you've got works on all the systems you expect, you have the choice. If you start getting to the point where there is no single piece of code that works across all the expected systems, then you have an issue. I don't think we're there yet, but I don't think using a function pointer would be all that expensive? Performence measuring I guess... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Tom Lane wrote: > This seems pretty unworkable from a packaging standpoint. Even if > you teach autoconf how to tell which model it's running on, there's > no guarantee that the resulting executables will be used on that same > machine. A number of packages in the video area (and perhaps others) do compile "sub-architecture" specific variants. This could be done for PostgreSQL, but you'd probably need to show some pretty convincing performance numbers before people start the packaging effort. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote: > Tom Lane wrote: > > This seems pretty unworkable from a packaging standpoint. Even if > > you teach autoconf how to tell which model it's running on, there's > > no guarantee that the resulting executables will be used on that same > > machine. > > A number of packages in the video area (and perhaps others) do compile > "sub-architecture" specific variants. This could be done for > PostgreSQL, but you'd probably need to show some pretty convincing > performance numbers before people start the packaging effort. I completely agree, just note that we already have some cases where convincing performance numbers exist. Tom is suggesting having different behaviour for x86 and x86_64. The x86 will still run on x86_64 architecture would it not? So we'll have two binaries for each OS, yes? In general, where we do find a clear difference, we should at very least identify/record which variant the binary is most suitable for. At best we could produce different executables, but I understand the packaging effort required to do that. Best Regards, Simon Riggs
As an aside, here is a package that has recently been BSD re-licensed: http://sourceforge.net/projects/libltx/ It is a lightweight memory transaction package. It comes with a paper entitled "Cache Sensitive Software Transactional Memory" by Robert Ennals. In the paper, Robert Ennals suggests this form of concurrent programming as a replacement for lock based programming. A quote: "We have now reached the point where transactions are outperforming locks -- and people are starting to get interested." There are a number of interesting claims in the paper. Since the license is now compatible, it may have some interest for integration into the PostgreSQL core where appropriate. It would certainly be worthwhile to read the paper and fool around with the supplied test driver to compare the approaches. If nobody on the PostgreSQL team has time for the experimentations, it might be a good project for a PhD candidate at some university. > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- > owner@postgresql.org] On Behalf Of Simon Riggs > Sent: Tuesday, October 11, 2005 10:56 AM > To: Peter Eisentraut > Cc: pgsql-hackers@postgresql.org; Tom Lane > Subject: Re: [HACKERS] Spinlocks and CPU Architectures > > On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote: > > Tom Lane wrote: > > > This seems pretty unworkable from a packaging standpoint. Even if > > > you teach autoconf how to tell which model it's running on, there's > > > no guarantee that the resulting executables will be used on that same > > > machine. > > > > A number of packages in the video area (and perhaps others) do compile > > "sub-architecture" specific variants. This could be done for > > PostgreSQL, but you'd probably need to show some pretty convincing > > performance numbers before people start the packaging effort. > > I completely agree, just note that we already have some cases where > convincing performance numbers exist. > > Tom is suggesting having different behaviour for x86 and x86_64. The x86 > will still run on x86_64 architecture would it not? So we'll have two > binaries for each OS, yes? > > In general, where we do find a clear difference, we should at very least > identify/record which variant the binary is most suitable for. At best > we could produce different executables, but I understand the packaging > effort required to do that. > > Best Regards, Simon Riggs > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote: >> A number of packages in the video area (and perhaps others) do compile >> "sub-architecture" specific variants. This could be done for >> PostgreSQL, but you'd probably need to show some pretty convincing >> performance numbers before people start the packaging effort. > I completely agree, just note that we already have some cases where > convincing performance numbers exist. I'm not sure the data we have is convincing enough to justify sub-architecture packaging. We have a somewhat-artificial test case that's designed to generate the maximum possible spinlock contention, and we can measure a significant difference in that context --- but that's a long way from saying that a similar percentage difference would be seen in real-world applications. It's also a long way from saying that the effort needed to set this up wouldn't be better repaid working on other problems. One thought that comes to mind is that these decisions are probably comparable to those made by gcc conditional on -march flags. Do we get access to the -march setting by means of predefined symbols? If so we could compile different TAS code for opteron and em64t without introducing any packaging issues that didn't exist already. It would essentially be up to the source-code builder which sub-arch to tune for. (Since the assembly code in question currently works only for gcc anyway, I'm not too concerned about making gcc-specific assumptions about the availability of flag macros.) regards, tom lane
On Tue, Oct 11, 2005 at 02:28:02PM -0400, Tom Lane wrote: > One thought that comes to mind is that these decisions are probably > comparable to those made by gcc conditional on -march flags. Do we > get access to the -march setting by means of predefined symbols? > If so we could compile different TAS code for opteron and em64t without > introducing any packaging issues that didn't exist already. It would > essentially be up to the source-code builder which sub-arch to tune for. The option is available see below. It appears __tune_xxx__ is set for the -mcpu option and __xxx__ is set for the -march option. This is gcc 3.3.5, but it probably works for older versions... -march controls which instructions to use, -mcpu is which scheduling to assume. Which do we use? The former implies the latter but not the other way round. Have a nice day, kleptog@vali:/tmp$ rm a.h kleptog@vali:/tmp$ touch a.h kleptog@vali:/tmp$ gcc -mcpu=i386 -E -dM a.h >/tmp/a kleptog@vali:/tmp$ gcc -mcpu=pentium -E -dM a.h >/tmp/b kleptog@vali:/tmp$ diff -u0 /tmp/a /tmp/b --- /tmp/a 2005-10-11 21:02:38.000000000 +0200 +++ /tmp/b 2005-10-11 21:02:41.000000000 +0200 @@ -28,0 +29 @@ +#define __tune_pentium__ 1 @@ -44,0 +46 @@ +#define __tune_i586__ 1 @@ -61 +62,0 @@ -#define __tune_i386__ 1 kleptog@vali:/tmp$ gcc -march=i386 -E -dM a.h >/tmp/a kleptog@vali:/tmp$ gcc -march=athlon -E -dM a.h >/tmp/b kleptog@vali:/tmp$ diff -u0 /tmp/a /tmp/b --- /tmp/a 2005-10-11 21:03:57.000000000 +0200 +++ /tmp/b 2005-10-11 21:05:47.000000000 +0200 @@ -4,0 +5 @@ +#define __3dNOW_A__ 1 @@ -16,0 +18 @@ +#define __athlon 1 @@ -41,0 +44 @@ +#define __MMX__ 1 @@ -46,0 +50 @@ +#define __athlon__ 1 @@ -59 +62,0 @@ -#define __tune_i386__ 1 @@ -63,0 +67,2 @@ +#define __tune_athlon__ 1 +#define __3dNOW__ 1 -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Simon Riggs wrote: > Tom is suggesting having different behaviour for x86 and x86_64. The > x86 will still run on x86_64 architecture would it not? So we'll have > two binaries for each OS, yes? A quick glance around tells me that most free operating systems are treating x86 and x86_64 as separate platforms anyway, so having different code for each is not a problem. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Oct 11, 2005 at 02:28:02PM -0400, Tom Lane wrote: >> One thought that comes to mind is that these decisions are probably >> comparable to those made by gcc conditional on -march flags. Do we >> get access to the -march setting by means of predefined symbols? > The option is available see below. It appears __tune_xxx__ is set for > the -mcpu option and __xxx__ is set for the -march option. This is gcc > 3.3.5, but it probably works for older versions... Actually, after reviewing the thread from last month, I was misremembering: *all* of the test cases we had for x86_64 showed the cmpb to be a loss. It was only on plain x86 that there was some difference of results for that patch. So I think we should just remove the cmpb unconditionally for x86_64, and be done with it. We can leave the x86 case alone, at least for now, because there didn't seem to be any cases of big wins there. The reason I was confused was that the related patch to vary the spinlock delay loop count seemed to be a win on Opteron but a loss on EM64T. This probably means that we need a smarter algorithm for varying the loop count --- the upper limit has to be configurable, perhaps. regards, tom lane