Re: Add spin_delay() implementation for Arm in s_lock.h - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Add spin_delay() implementation for Arm in s_lock.h
Date
Msg-id 1269414.1641521533@sss.pgh.pa.us
Whole thread Raw
In response to Re: Add spin_delay() implementation for Arm in s_lock.h  ("Blake, Geoff" <blakgeof@amazon.com>)
Responses Re: Add spin_delay() implementation for Arm in s_lock.h
List pgsql-hackers
"Blake, Geoff" <blakgeof@amazon.com> writes:
> Hope everything is well going into the new year.  I'd like to pick this discussion back up and your thoughts on the
patchwith the data I posted 2 weeks prior.  Is there more data that would be helpful?  Different setup?  Data on older
versionsof Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot
algorithmthat exhibited quite a bit of synchronization contention? 

I spent some time working on this.  I don't have a lot of faith in
pgbench as a measurement testbed for spinlock contention, because over
the years we've done a good job of getting rid of that in our main
code paths (both the specific change you mention, and many others).
After casting around a bit and thinking about writing a bespoke test
framework, I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes.  The
attached quick-hack patch makes it grab and release a spinlock once
per passed message.  I'd initially expected that this would show only
marginal changes, because you'd hope that a spinlock acquisition would
be reasonably cheap compared to shm_mq_receive plus shm_mq_send.
Turns out not though.

The proposed test case is

(1) patch test_shm_mq as below

(2) time this query:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n);

for various values of "n" up to about how many cores you have.
(You'll probably need to bump up max_worker_processes.)

For context, on my Intel-based main workstation (8-core Xeon W-2245),
the time to do this with stock test_shm_mq is fairly level.
Reporting best-of-3 runs:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1386.413 ms (00:01.386)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 1302.503 ms (00:01.303)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 1373.121 ms (00:01.373)

However, after applying the contention patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1346.362 ms (00:01.346)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3313.490 ms (00:03.313)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 7660.329 ms (00:07.660)

So this seems like (a) it's a plausible model for code that has
unoptimized spinlock contention, and (b) the effects are large
enough that you needn't fret too much about measurement noise.

I tried this out on a handy Apple M1 mini, which I concede
is not big iron but it's pretty decent aarch64 hardware.
With current HEAD's spinlock code, I get (again best-of-3):

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1630.255 ms (00:01.630)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3495.066 ms (00:03.495)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19541.929 ms (00:19.542)

With your spin-delay patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1643.524 ms (00:01.644)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3404.625 ms (00:03.405)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19260.721 ms (00:19.261)

So I don't see a lot of reason to think your patch changes anything.
Maybe on something with more cores?

For grins I also tried this same test with the use-CAS-for-TAS patch
that was being discussed November before last, and it didn't
really show up as any improvement either:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1608.642 ms (00:01.609)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3396.564 ms (00:03.397)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 20092.683 ms (00:20.093)

Maybe that's a little better in the uncontended (single-worker)
case, but it's worse at the high end.

I'm really curious to hear if this measurement method shows
any interesting improvements on your hardware.

            regards, tom lane

diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c
index e05e97c6de..d535dbe408 100644
--- a/src/test/modules/test_shm_mq/setup.c
+++ b/src/test/modules/test_shm_mq/setup.c
@@ -135,6 +135,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers,
     hdr->workers_total = nworkers;
     hdr->workers_attached = 0;
     hdr->workers_ready = 0;
+    hdr->messages_xfred = 0;
     shm_toc_insert(toc, 0, hdr);

     /* Set up one message queue per worker, plus one. */
diff --git a/src/test/modules/test_shm_mq/test_shm_mq.h b/src/test/modules/test_shm_mq/test_shm_mq.h
index a666121834..ff35211811 100644
--- a/src/test/modules/test_shm_mq/test_shm_mq.h
+++ b/src/test/modules/test_shm_mq/test_shm_mq.h
@@ -32,6 +32,7 @@ typedef struct
     int            workers_total;
     int            workers_attached;
     int            workers_ready;
+    uint64        messages_xfred;
 } test_shm_mq_header;

 /* Set up dynamic shared memory and background workers for test run. */
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 9b037b98fe..5aceb8ca73 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -31,7 +31,8 @@
 static void attach_to_queues(dsm_segment *seg, shm_toc *toc,
                              int myworkernumber, shm_mq_handle **inqhp,
                              shm_mq_handle **outqhp);
-static void copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh);
+static void copy_messages(volatile test_shm_mq_header *hdr,
+                          shm_mq_handle *inqh, shm_mq_handle *outqh);

 /*
  * Background worker entrypoint.
@@ -131,7 +132,7 @@ test_shm_mq_main(Datum main_arg)
     SetLatch(®istrant->procLatch);

     /* Do the work. */
-    copy_messages(inqh, outqh);
+    copy_messages(hdr, inqh, outqh);

     /*
      * We're done.  For cleanliness, explicitly detach from the shared memory
@@ -173,7 +174,8 @@ attach_to_queues(dsm_segment *seg, shm_toc *toc, int myworkernumber,
  * after this point is cleanup.
  */
 static void
-copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
+copy_messages(volatile test_shm_mq_header *hdr,
+              shm_mq_handle *inqh, shm_mq_handle *outqh)
 {
     Size        len;
     void       *data;
@@ -189,7 +191,12 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
         if (res != SHM_MQ_SUCCESS)
             break;

-        /* Send it back out. */
+        /* Create some contention on the mutex. */
+        SpinLockAcquire(&hdr->mutex);
+        ++hdr->messages_xfred;
+        SpinLockRelease(&hdr->mutex);
+
+        /* Send the message back out. */
         res = shm_mq_send(outqh, len, data, false, true);
         if (res != SHM_MQ_SUCCESS)
             break;

pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
Next
From: "tanghy.fnst@fujitsu.com"
Date:
Subject: RE: Support tab completion for upper character inputs in psql