Re: spinlocks on HP-UX - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: spinlocks on HP-UX |
Date | |
Msg-id | 22039.1314573597@sss.pgh.pa.us Whole thread Raw |
In response to | Re: spinlocks on HP-UX (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: spinlocks on HP-UX
|
List | pgsql-hackers |
I wrote: > Yeah, I figured out that was probably what you meant a little while > later. I found a 64-CPU IA64 machine in Red Hat's test labs and am > currently trying to replicate your results; report to follow. OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64 independent sockets, no hyperthreading or any funny business); 124GB in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2. I built today's git head with --enable-debug (but not --enable-cassert) and ran with all default configuration settings except shared_buffers = 8GB and max_connections = 200. The test database is initialized at -s 100. I did not change the database between runs, but restarted the postmaster and then did this to warm the caches a tad: pgbench -c 1 -j 1 -S -T 30 bench Per-run pgbench parameters are as shown below --- note in particular that I assigned one pgbench thread per 8 backends. The numbers are fairly variable even with 5-minute runs; I did each series twice so you could get a feeling for how much. Today's git head: pgbench -c 1 -j 1 -S -T 300 bench tps = 5835.213934 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8499.223161 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 15197.126952 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 30803.255561 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 65795.356797 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 81644.914241 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 40059.202836 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 21309.615001 (including ... run 2: pgbench -c 1 -j 1 -S -T 300 bench tps = 5787.310115 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8747.104236 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 14655.369995 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 28287.254924 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 61614.715187 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 79754.640518 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 40334.994324 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 23285.271257 (including ... With modified TAS macro (see patch 1 below): pgbench -c 1 -j 1 -S -T 300 bench tps = 6171.454468 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8709.003728 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 14902.731035 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 29789.744482 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 59991.549128 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 117369.287466 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 112583.144495 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 110231.305282 (including ... run 2: pgbench -c 1 -j 1 -S -T 300 bench tps = 5670.097936 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8230.786940 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 14785.952481 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 29335.875139 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 59605.433837 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 108884.294519 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 110387.439978 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 109046.121191 (including ... With unlocked test in s_lock.c delay loop only (patch 2 below): pgbench -c 1 -j 1 -S -T 300 bench tps = 5426.491088 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8787.939425 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 15720.801359 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 33711.102718 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 61829.180234 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 109781.655020 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 107132.848280 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 106533.630986 (including ... run 2: pgbench -c 1 -j 1 -S -T 300 bench tps = 5705.283316 (including ... pgbench -c 2 -j 1 -S -T 300 bench tps = 8442.798662 (including ... pgbench -c 8 -j 1 -S -T 300 bench tps = 14423.723837 (including ... pgbench -c 16 -j 2 -S -T 300 bench tps = 29112.751995 (including ... pgbench -c 32 -j 4 -S -T 300 bench tps = 62258.984033 (including ... pgbench -c 64 -j 8 -S -T 300 bench tps = 107741.988800 (including ... pgbench -c 96 -j 12 -S -T 300 bench tps = 107138.968981 (including ... pgbench -c 128 -j 16 -S -T 300 bench tps = 106110.215138 (including ... So this pretty well confirms Robert's results, in particular that all of the win from an unlocked test comes from using it in the delay loop. Given the lack of evidence that a general change in TAS() is beneficial, I'm inclined to vote against it, on the grounds that the extra test is surely a loss at some level when there is not contention. (IOW, +1 for inventing a second macro to use in the delay loop only.) We ought to do similar tests on other architectures. I found some lots-o-processors x86_64 machines at Red Hat, but they don't seem to own any PPC systems with more than 8 processors. Anybody have big iron with other non-Intel chips? regards, tom lane Patch 1: change TAS globally, non-HPUX code: *** src/include/storage/s_lock.h.orig Sat Jan 1 13:27:24 2011 --- src/include/storage/s_lock.h Sun Aug 28 13:32:47 2011 *************** *** 228,233 **** --- 228,240 ---- { long int ret; + /* + * Use a non-locking test before the locking instruction proper. This + * appears to be a very significant win on many-core IA64. + */ + if (*lock) + return 1; + __asm__ __volatile__( " xchg4 %0=%1,%2 \n" : "=r"(ret), "+m"(*lock) *************** *** 243,248 **** --- 250,262 ---- { int ret; + /* + * Use a non-locking test before the locking instruction proper. This + * appears to be a very significant win on many-core IA64. + */ + if (*lock) + return 1; + ret = _InterlockedExchange(lock,1); /* this is a xchg asm macro */ return ret; Patch 2: change s_lock only (same as Robert's quick hack): *** src/backend/storage/lmgr/s_lock.c.orig Sat Jan 1 13:27:09 2011 --- src/backend/storage/lmgr/s_lock.c Sun Aug 28 14:02:29 2011 *************** *** 96,102 **** int delays = 0; int cur_delay = 0; ! while (TAS(lock)) { /* CPU-specific delay each time through the loop */ SPIN_DELAY(); --- 96,102 ---- int delays = 0; int cur_delay = 0; ! while (*lock ? 1 : TAS(lock)) { /* CPU-specific delay each time through the loop */ SPIN_DELAY();
pgsql-hackers by date: