Thread: SMP scaling
On Tue, Mar 15, 2005 at 07:00:25PM -0500, Bruce Momjian wrote: > Oh, you have to try CVS HEAD or a nightly snapshot. Tom made a major > change that allows scaling in SMP environments. Ok, I've done the tests comparing 8.0.1 against a snapshot from the 16th and the results are impressive. As well as the 16CPU Altix, I've done comparisons on two other 4CPU machines which previously didn't scale as well as expected. Clients 1 2 3 4 6 8 12 16 32 64 --------------------------------------------------------------------------------------- Altix pg-8.0.1 1.00 2.02 2.98 3.97 5.87 7.23 7.51 5.54 4.68 5.10 Altix pg-20050316 1.00 1.97 2.86 3.68 5.29 6.90 9.00 9.88 10.03 9.94 AMD64 pg-8.0.1 1.00 1.87 2.77 3.34 2.73 2.57 2.58 2.62 AMD64 pg-20050316 1.00 1.95 2.84 3.69 3.61 3.66 3.70 3.69 IA64 pg-8.0.1 1.00 1.97 2.91 3.82 2.91 2.92 2.94 2.98 IA64 pg-20050316 1.00 1.98 2.95 3.87 3.80 3.78 3.86 3.90 Altix == 16x 1.6GHz Itanium2 192GB memory AMD64 == 4x 2.2GHz Opteron 848 8GB memory IA64 == 4x 1.5GHz Itanium2 16GB memory The altix still only scales up to 10x rather than 16x, but that probably is the NUMA configuration taking effect now. Also this machine isn't set up to run databases, so only has 1 FC I/O card, which means a CPU can end up being 4 hops away from the memory and disk. As the database is so small (8GB), relative to the machine, the data will be on average 2 hops away. This gives an average of 72% of the speed of local memory, based on previous measurements of speed vs hops. So getting 63% of the theoretical maximum database throughput is pretty good. -Mark
Mark Rae <mrae@purplebat.com> writes: > Ok, I've done the tests comparing 8.0.1 against a snapshot from the 16th > and the results are impressive. > Clients 1 2 3 4 6 8 12 16 32 64 > --------------------------------------------------------------------------------------- > Altix pg-8.0.1 1.00 2.02 2.98 3.97 5.87 7.23 7.51 5.54 4.68 5.10 > Altix pg-20050316 1.00 1.97 2.86 3.68 5.29 6.90 9.00 9.88 10.03 9.94 > AMD64 pg-8.0.1 1.00 1.87 2.77 3.34 2.73 2.57 2.58 2.62 > AMD64 pg-20050316 1.00 1.95 2.84 3.69 3.61 3.66 3.70 3.69 > IA64 pg-8.0.1 1.00 1.97 2.91 3.82 2.91 2.92 2.94 2.98 > IA64 pg-20050316 1.00 1.98 2.95 3.87 3.80 3.78 3.86 3.90 Hey, that looks pretty sweet. One thing this obscures though is whether there is any change in the single-client throughput rate --- ie, is "1.00" better or worse for CVS tip vs 8.0.1? regards, tom lane
On Fri, Mar 18, 2005 at 10:38:24AM -0500, Tom Lane wrote: > Hey, that looks pretty sweet. One thing this obscures though is whether > there is any change in the single-client throughput rate --- ie, is "1.00" > better or worse for CVS tip vs 8.0.1? Here are the figures in queries per second. Clients 1 2 3 4 6 8 12 16 32 64 --------------------------------------------------------------------------------------- AMD64 pg-8.0.1 6.80 12.71 18.82 22.73 18.58 17.48 17.56 17.81 AMD64 pg-20050316 6.80 13.23 19.32 25.09 24.56 24.93 25.20 25.09 IA64 pg-8.0.1 3.72 7.32 10.81 14.21 10.81 10.85 10.92 11.09 IA64 pg-20050316 3.99 7.92 11.78 15.46 15.17 15.09 15.41 15.58 Altix pg-8.0.1 3.66 7.37 10.89 14.53 21.47 26.47 27.47 20.28 17.12 18.66 Altix pg-20050316 3.83 7.55 10.98 14.10 20.27 26.47 34.50 37.88 38.45 38.12 So, it didn't make any difference for the Opteron, but the two Itanium machines were 5% and 7% faster respectively. -Mark
Mark Rae wrote: > On Fri, Mar 18, 2005 at 10:38:24AM -0500, Tom Lane wrote: > > Hey, that looks pretty sweet. One thing this obscures though is whether > > there is any change in the single-client throughput rate --- ie, is "1.00" > > better or worse for CVS tip vs 8.0.1? > > Here are the figures in queries per second. > > Clients 1 2 3 4 6 8 12 16 32 64 > --------------------------------------------------------------------------------------- > AMD64 pg-8.0.1 6.80 12.71 18.82 22.73 18.58 17.48 17.56 17.81 > AMD64 pg-20050316 6.80 13.23 19.32 25.09 24.56 24.93 25.20 25.09 > IA64 pg-8.0.1 3.72 7.32 10.81 14.21 10.81 10.85 10.92 11.09 > IA64 pg-20050316 3.99 7.92 11.78 15.46 15.17 15.09 15.41 15.58 > Altix pg-8.0.1 3.66 7.37 10.89 14.53 21.47 26.47 27.47 20.28 17.12 18.66 > Altix pg-20050316 3.83 7.55 10.98 14.10 20.27 26.47 34.50 37.88 38.45 38.12 > > So, it didn't make any difference for the Opteron, but the two > Itanium machines were 5% and 7% faster respectively. So it seems our entire SMP problem was that global lock. Nice. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > So it seems our entire SMP problem was that global lock. Nice. Yeah, I was kind of expecting to see the LockMgrLock up next, but it seems we're still a ways away from having a problem there. I guess that's because we only tend to touch locks once per query, whereas we're grabbing and releasing buffers much more. From the relatively small absolute value of Mark's queries/sec numbers, I suppose he is testing some fairly heavyweight queries (big enough to not emphasize per-query overhead). I wonder what the numbers would look like with very small, simple queries. It'd move the stress around for sure ... regards, tom lane
Mark Rae <mrae@purplebat.com> writes: > The altix still only scales up to 10x rather than 16x, but that probably > is the NUMA configuration taking effect now. BTW, although I know next to nothing about NUMA, I do know that it is configurable to some extent (eg, via numactl). What was the configuration here exactly, and did you try alternatives? Also, what was the OS exactly? (I've heard that RHEL4 is a whole lot better than RHEL3 in managing NUMA, for example. This may be generic to 2.6 vs 2.4 Linux kernels, or maybe Red Hat did some extra hacking.) regards, tom lane
On Fri, Mar 18, 2005 at 01:31:51PM -0500, Tom Lane wrote: > BTW, although I know next to nothing about NUMA, I do know that it is > configurable to some extent (eg, via numactl). What was the > configuration here exactly, and did you try alternatives? Also, > what was the OS exactly? (I've heard that RHEL4 is a whole lot better > than RHEL3 in managing NUMA, for example. This may be generic to 2.6 vs > 2.4 Linux kernels, or maybe Red Hat did some extra hacking.) The Altix uses a 2.4.21 kernel with SGI's own modifications to support up to 256 CPUs and their NUMALink hadware. (Some of which has become the NUMA code in the 2.6 kernel) Even with the numa support, which makes sure any memory allocated by malloc or the stack ends up local to the processor which originally called it, and then continues to schedule the process on that CPU, there is still the problem that all table accesses* go through the shared buffer cache which resides in one location. [* is this true in all cases?] I was about to write a long explaination about how the only way to scale out to this size would be to have separate buffer caches in each memory domain, and this would then require some kind of cache coherency mechanism. But after reading a few bits of documentation, it looks like SGI already have a solution in the form of symmetric data objects. In particular, the symmetric heap, an area of shared memory which is replicated across all memory domains with the coherency being handled in hardware. So it looks like all that might be needed is to replace the shmget calls in src/backend/port with the equivalent SGI functions. -Mark
Mark Rae <mrae@purplebat.com> writes: > Even with the numa support, which makes sure any memory allocated by > malloc or the stack ends up local to the processor which originally > called it, and then continues to schedule the process on that CPU, > there is still the problem that all table accesses* go through > the shared buffer cache which resides in one location. > [* is this true in all cases?] Temp tables are handled in backend-local memory, but all normal tables have to be accessed through shared buffers. The implications of not doing that are bad enough that it's hard to believe it could be a win to change. (In short: the hardware may not like syncing across processors, but it can still do it faster than we could hope to do in software.) > it looks like SGI already have a solution in the form of > symmetric data objects. > In particular, the symmetric heap, an area of shared memory > which is replicated across all memory domains with the > coherency being handled in hardware. Hmm, do they support spinlock objects in this memory? If so it could be just the right thing. > So it looks like all that might be needed is to replace the > shmget calls in src/backend/port with the equivalent SGI functions. Send a patch ;-) regards, tom lane