Thread: Add spin_delay() implementation for Arm in s_lock.h

Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Hi,

Have a tiny patch to add an implementation of spin_delay() for Arm64 processors to match behavior with x86's PAUSE
instruction. See negligible benefit on the pgbench tpcb-like workload so at worst it appears to do no harm but should
helpsome workloads that experience some lock contention that need to spin.
 

Thanks,
Geoffrey Blake


Attachment

Re: Add spin_delay() implementation for Arm in s_lock.h

From
Tom Lane
Date:
"Blake, Geoff" <blakgeof@amazon.com> writes:
> Have a tiny patch to add an implementation of spin_delay() for Arm64 processors to match behavior with x86's PAUSE
instruction. See negligible benefit on the pgbench tpcb-like workload so at worst it appears to do no harm but should
helpsome workloads that experience some lock contention that need to spin. 

Given the very wide variety of ARM implementations out there,
I'm not sure that we want to take a patch like this on the basis of
exactly zero evidence.  It could as easily be a net loss as a win.
What did you test exactly?

            regards, tom lane



Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Hi Tom,

> What did you test exactly?

Tested 3 benchmark configurations on an m6g.16xlarge (Graviton2, 64 cpus, 256GB RAM)
I set the scale factor to consume about 1/3 of 256GB and the other parameters in the next line.
pgbench setup: -F 90 -s 5622 -c 256
Pgbench select-only w/   patch  662804 tps (-0.5%)
                                     w/o patch  666354 tps. 
                tcpb-like     w/   patch  35844 tps (0%)
                                    w/o patch  35835 tps

We also test with Hammerdb when evaluating patches, it shows the patch gets +3%:
Hammerdb (192 Warehouse 256 clients)
w/   patch 1147463 NOPM (+3%)
w/o patch 1112908 NOPM

I've run pgbench more than once and the measured TPS values overlap, even though the means on select-only show a small
degradationat the moment I am concluding it is noise.  On Hammerdb, the results show a measurable difference.
 

Thanks,
Geoff




Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Tom,

Hope everything is well going into the new year.  I'd like to pick this discussion back up and your thoughts on the
patchwith the data I posted 2 weeks prior.  Is there more data that would be helpful?  Different setup?  Data on older
versionsof Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot
algorithmthat exhibited quite a bit of synchronization contention?
 

Thanks,
Geoff



Re: Add spin_delay() implementation for Arm in s_lock.h

From
Tom Lane
Date:
"Blake, Geoff" <blakgeof@amazon.com> writes:
> Hope everything is well going into the new year.  I'd like to pick this discussion back up and your thoughts on the
patchwith the data I posted 2 weeks prior.  Is there more data that would be helpful?  Different setup?  Data on older
versionsof Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot
algorithmthat exhibited quite a bit of synchronization contention? 

I spent some time working on this.  I don't have a lot of faith in
pgbench as a measurement testbed for spinlock contention, because over
the years we've done a good job of getting rid of that in our main
code paths (both the specific change you mention, and many others).
After casting around a bit and thinking about writing a bespoke test
framework, I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes.  The
attached quick-hack patch makes it grab and release a spinlock once
per passed message.  I'd initially expected that this would show only
marginal changes, because you'd hope that a spinlock acquisition would
be reasonably cheap compared to shm_mq_receive plus shm_mq_send.
Turns out not though.

The proposed test case is

(1) patch test_shm_mq as below

(2) time this query:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n);

for various values of "n" up to about how many cores you have.
(You'll probably need to bump up max_worker_processes.)

For context, on my Intel-based main workstation (8-core Xeon W-2245),
the time to do this with stock test_shm_mq is fairly level.
Reporting best-of-3 runs:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1386.413 ms (00:01.386)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 1302.503 ms (00:01.303)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 1373.121 ms (00:01.373)

However, after applying the contention patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1346.362 ms (00:01.346)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3313.490 ms (00:03.313)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 7660.329 ms (00:07.660)

So this seems like (a) it's a plausible model for code that has
unoptimized spinlock contention, and (b) the effects are large
enough that you needn't fret too much about measurement noise.

I tried this out on a handy Apple M1 mini, which I concede
is not big iron but it's pretty decent aarch64 hardware.
With current HEAD's spinlock code, I get (again best-of-3):

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1630.255 ms (00:01.630)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3495.066 ms (00:03.495)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19541.929 ms (00:19.542)

With your spin-delay patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1643.524 ms (00:01.644)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3404.625 ms (00:03.405)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19260.721 ms (00:19.261)

So I don't see a lot of reason to think your patch changes anything.
Maybe on something with more cores?

For grins I also tried this same test with the use-CAS-for-TAS patch
that was being discussed November before last, and it didn't
really show up as any improvement either:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1608.642 ms (00:01.609)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3396.564 ms (00:03.397)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 20092.683 ms (00:20.093)

Maybe that's a little better in the uncontended (single-worker)
case, but it's worse at the high end.

I'm really curious to hear if this measurement method shows
any interesting improvements on your hardware.

            regards, tom lane

diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c
index e05e97c6de..d535dbe408 100644
--- a/src/test/modules/test_shm_mq/setup.c
+++ b/src/test/modules/test_shm_mq/setup.c
@@ -135,6 +135,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers,
     hdr->workers_total = nworkers;
     hdr->workers_attached = 0;
     hdr->workers_ready = 0;
+    hdr->messages_xfred = 0;
     shm_toc_insert(toc, 0, hdr);

     /* Set up one message queue per worker, plus one. */
diff --git a/src/test/modules/test_shm_mq/test_shm_mq.h b/src/test/modules/test_shm_mq/test_shm_mq.h
index a666121834..ff35211811 100644
--- a/src/test/modules/test_shm_mq/test_shm_mq.h
+++ b/src/test/modules/test_shm_mq/test_shm_mq.h
@@ -32,6 +32,7 @@ typedef struct
     int            workers_total;
     int            workers_attached;
     int            workers_ready;
+    uint64        messages_xfred;
 } test_shm_mq_header;

 /* Set up dynamic shared memory and background workers for test run. */
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 9b037b98fe..5aceb8ca73 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -31,7 +31,8 @@
 static void attach_to_queues(dsm_segment *seg, shm_toc *toc,
                              int myworkernumber, shm_mq_handle **inqhp,
                              shm_mq_handle **outqhp);
-static void copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh);
+static void copy_messages(volatile test_shm_mq_header *hdr,
+                          shm_mq_handle *inqh, shm_mq_handle *outqh);

 /*
  * Background worker entrypoint.
@@ -131,7 +132,7 @@ test_shm_mq_main(Datum main_arg)
     SetLatch(®istrant->procLatch);

     /* Do the work. */
-    copy_messages(inqh, outqh);
+    copy_messages(hdr, inqh, outqh);

     /*
      * We're done.  For cleanliness, explicitly detach from the shared memory
@@ -173,7 +174,8 @@ attach_to_queues(dsm_segment *seg, shm_toc *toc, int myworkernumber,
  * after this point is cleanup.
  */
 static void
-copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
+copy_messages(volatile test_shm_mq_header *hdr,
+              shm_mq_handle *inqh, shm_mq_handle *outqh)
 {
     Size        len;
     void       *data;
@@ -189,7 +191,12 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
         if (res != SHM_MQ_SUCCESS)
             break;

-        /* Send it back out. */
+        /* Create some contention on the mutex. */
+        SpinLockAcquire(&hdr->mutex);
+        ++hdr->messages_xfred;
+        SpinLockRelease(&hdr->mutex);
+
+        /* Send the message back out. */
         res = shm_mq_send(outqh, len, data, false, true);
         if (res != SHM_MQ_SUCCESS)
             break;

Re: Add spin_delay() implementation for Arm in s_lock.h

From
Andres Freund
Date:
Hi,

> I landed on the idea of adding some intentional spinlock
> contention to src/test/modules/test_shm_mq, which is a prefab test
> framework for passing data among multiple worker processes.  The
> attached quick-hack patch makes it grab and release a spinlock once
> per passed message.

I wonder if this will show the full set of spinlock contention issues - isn't
this only causing contention for one spinlock between two processes? It's not
too hard to imagine delays being more important the more processes contend for
one cacheline.  I only skimmed your changes, so I might also just have
misunderstood what you were doing...

Greetings,

Andres Freund



Re: Add spin_delay() implementation for Arm in s_lock.h

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
>> I landed on the idea of adding some intentional spinlock
>> contention to src/test/modules/test_shm_mq, which is a prefab test
>> framework for passing data among multiple worker processes.  The
>> attached quick-hack patch makes it grab and release a spinlock once
>> per passed message.

> I wonder if this will show the full set of spinlock contention issues - isn't
> this only causing contention for one spinlock between two processes?

I don't think so -- the point of using the "pipelined" variant is
that messages are passing between all N worker processes concurrently.
(With the proposed test, I see N processes all pinning their CPUs;
if I use the non-pipelined API, they are busy but nowhere near 100%.)

It is just one spinlock, true, but I think the point is to gauge
what happens with N processes all contending for the same lock.
We could add some more complexity to use multiple locks, but
does that really add anything but complexity?

            regards, tom lane



Re: Add spin_delay() implementation for Arm in s_lock.h

From
Andres Freund
Date:
Hi,

On 2022-01-06 21:39:57 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wonder if this will show the full set of spinlock contention issues - isn't
> > this only causing contention for one spinlock between two processes?
> 
> I don't think so -- the point of using the "pipelined" variant is
> that messages are passing between all N worker processes concurrently.
> (With the proposed test, I see N processes all pinning their CPUs;
> if I use the non-pipelined API, they are busy but nowhere near 100%.)

My understanding of the shm_mq code is that that ends up with N shm_mq
instances, one for each worker. After all:

> * shm_mq.c
> *      single-reader, single-writer shared memory message queue


These separate shm_mq instances forward messages in a circle,
"leader"->worker_1->worker_2->...->"leader". So there isn't a single contended
spinlock, but a bunch of different spinlocks, each with at most two backends
accessing it?


> It is just one spinlock, true, but I think the point is to gauge
> what happens with N processes all contending for the same lock.
> We could add some more complexity to use multiple locks, but
> does that really add anything but complexity?

Right, I agree that that's what we shoudl test - it's just no immediately
obvious to me that we are.

Greetings,

Andres Freund



Re: Add spin_delay() implementation for Arm in s_lock.h

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> These separate shm_mq instances forward messages in a circle,
> "leader"->worker_1->worker_2->...->"leader". So there isn't a single contended
> spinlock, but a bunch of different spinlocks, each with at most two backends
> accessing it?

No; there's just one spinlock.  I'm re-purposing the spinlock that
test_shm_mq uses to protect its setup operations (and thereafter
ignores).  AFAICS the N+1 shm_mq instances don't internally contain
spinlocks; they all use atomic ops.

(Well, on crappy architectures maybe there's spinlocks underneath
the atomic ops, but I don't think we care about such cases here.)

            regards, tom lane



Re: Add spin_delay() implementation for Arm in s_lock.h

From
Andres Freund
Date:
On 2022-01-06 22:23:38 -0500, Tom Lane wrote:
> No; there's just one spinlock.  I'm re-purposing the spinlock that
> test_shm_mq uses to protect its setup operations (and thereafter
> ignores).

Oh, sorry, misread :(


> AFAICS the N+1 shm_mq instances don't internally contain
> spinlocks; they all use atomic ops.

They contain spinlocks too, and the naming is similar enough that I got
confused:
struct shm_mq
{
    slock_t        mq_mutex;

We don't use them for all that much anymore though...

Greetings,

Andres Freund



Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Tom, Andres,

I spun up a 64-core Graviton2 instance (where I reported seeing improvement with this patch) and ran the provided
regressiontest with and without my proposed on top of mainline PG.  I ran 4 runs each of 63 workers where we should see
themost contention and most impact from the patch.  I am reporting the average and standard deviation, the average with
thepatch is 10% lower latency, but there is overlap in the standard deviation.  I'll gather additional data at lower
workercounts and post later to see what the trend is.
 

Cmd: postgres=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, workers);

Avg +/- standard dev
63 workers w/o patch: 552443ms +/- 22841ms
63 workers w/   patch: 502727 +/- 45253ms

Best results
w/o patch: 521216ms
w/   patch: 436442ms

Thanks,
Geoff



Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
As promised, here is the remaining data:

1 worker, w/o patch: 5236 ms +/- 252ms
1 worker, w/   patch: 5529 ms +/- 168ms

2 worker, w/o patch: 4917 ms +/- 180ms
2 worker, w/   patch: 4745 ms +/- 169ms

4 worker, w/o patch: 6564 ms +/- 336ms
4 worker, w/   patch: 6105 ms +/- 177ms

8 worker, w/o patch: 9575 ms +/- 2375ms
8 worker, w/   patch: 8115 ms +/- 391ms

16 worker, w/o patch: 19367 ms +/- 3543ms
16 worker, w/   patch: 18004 ms +/- 3701ms

32 worker, w/o patch: 101509 ms +/- 22651ms
32 worker, w/   patch: 104234 ms +/- 26821ms

48 worker, w/o patch: 243329 ms +/- 70037ms
48 worker, w/   patch: 189965 ms +/- 79459ms

64 worker, w/o patch: 552443 ms +/- 22841ms
64 worker, w/   patch: 502727 ms +/- 45253ms

From this data, on average the patch is beneficial at high worker (CPU) counts tested: 48, 63.  At 32 and below the
performanceis relatively close to each other.  
 

Thanks,
Geoff


Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Hi Tom, Andres,

Any additional feedback for this patch?

Thanks,
Geoff Blake


Re: Add spin_delay() implementation for Arm in s_lock.h

From
Tom Lane
Date:
"Blake, Geoff" <blakgeof@amazon.com> writes:
> Hi Tom, Andres,
> Any additional feedback for this patch?

I did some more research and testing:

* Using a Mac with the M1 Pro chip (marginally beefier than the M1
I was testing on before), I think I can see some benefit in the
test case I proposed upthread.  It's marginal though.

* On a Raspberry Pi 3B+, there's no outside-the-noise difference.

* ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
the patch to ARM64.  I doubt any flavor of ARM32 would be able to
benefit anyway.  (Googling finds that MariaDB made this same
choice not long ago [1].)

So what we've got is that there seems to be benefit at high
core counts, and it at least doesn't hurt at lower ones.
That's good enough for me, so pushed.

            regards, tom lane

[1] https://jira.mariadb.org/browse/MDEV-25807



Re: Add spin_delay() implementation for Arm in s_lock.h

From
"Blake, Geoff"
Date:
Thanks for all the help Tom!

On 4/6/22, 6:07 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 



    "Blake, Geoff" <blakgeof@amazon.com> writes:
    > Hi Tom, Andres,
    > Any additional feedback for this patch?

    I did some more research and testing:

    * Using a Mac with the M1 Pro chip (marginally beefier than the M1
    I was testing on before), I think I can see some benefit in the
    test case I proposed upthread.  It's marginal though.

    * On a Raspberry Pi 3B+, there's no outside-the-noise difference.

    * ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
    the patch to ARM64.  I doubt any flavor of ARM32 would be able to
    benefit anyway.  (Googling finds that MariaDB made this same
    choice not long ago [1].)

    So what we've got is that there seems to be benefit at high
    core counts, and it at least doesn't hurt at lower ones.
    That's good enough for me, so pushed.

                            regards, tom lane

    [1] https://jira.mariadb.org/browse/MDEV-25807