Re: Add spin_delay() implementation for Arm in s_lock.h - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Add spin_delay() implementation for Arm in s_lock.h |
Date | |
Msg-id | 1269414.1641521533@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Add spin_delay() implementation for Arm in s_lock.h ("Blake, Geoff" <blakgeof@amazon.com>) |
Responses |
Re: Add spin_delay() implementation for Arm in s_lock.h
|
List | pgsql-hackers |
"Blake, Geoff" <blakgeof@amazon.com> writes: > Hope everything is well going into the new year. I'd like to pick this discussion back up and your thoughts on the patchwith the data I posted 2 weeks prior. Is there more data that would be helpful? Different setup? Data on older versionsof Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot algorithmthat exhibited quite a bit of synchronization contention? I spent some time working on this. I don't have a lot of faith in pgbench as a measurement testbed for spinlock contention, because over the years we've done a good job of getting rid of that in our main code paths (both the specific change you mention, and many others). After casting around a bit and thinking about writing a bespoke test framework, I landed on the idea of adding some intentional spinlock contention to src/test/modules/test_shm_mq, which is a prefab test framework for passing data among multiple worker processes. The attached quick-hack patch makes it grab and release a spinlock once per passed message. I'd initially expected that this would show only marginal changes, because you'd hope that a spinlock acquisition would be reasonably cheap compared to shm_mq_receive plus shm_mq_send. Turns out not though. The proposed test case is (1) patch test_shm_mq as below (2) time this query: SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n); for various values of "n" up to about how many cores you have. (You'll probably need to bump up max_worker_processes.) For context, on my Intel-based main workstation (8-core Xeon W-2245), the time to do this with stock test_shm_mq is fairly level. Reporting best-of-3 runs: regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); Time: 1386.413 ms (00:01.386) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); Time: 1302.503 ms (00:01.303) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8); Time: 1373.121 ms (00:01.373) However, after applying the contention patch: regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); Time: 1346.362 ms (00:01.346) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); Time: 3313.490 ms (00:03.313) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8); Time: 7660.329 ms (00:07.660) So this seems like (a) it's a plausible model for code that has unoptimized spinlock contention, and (b) the effects are large enough that you needn't fret too much about measurement noise. I tried this out on a handy Apple M1 mini, which I concede is not big iron but it's pretty decent aarch64 hardware. With current HEAD's spinlock code, I get (again best-of-3): regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); Time: 1630.255 ms (00:01.630) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); Time: 3495.066 ms (00:03.495) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8); Time: 19541.929 ms (00:19.542) With your spin-delay patch: regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); Time: 1643.524 ms (00:01.644) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); Time: 3404.625 ms (00:03.405) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8); Time: 19260.721 ms (00:19.261) So I don't see a lot of reason to think your patch changes anything. Maybe on something with more cores? For grins I also tried this same test with the use-CAS-for-TAS patch that was being discussed November before last, and it didn't really show up as any improvement either: regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1); Time: 1608.642 ms (00:01.609) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4); Time: 3396.564 ms (00:03.397) regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8); Time: 20092.683 ms (00:20.093) Maybe that's a little better in the uncontended (single-worker) case, but it's worse at the high end. I'm really curious to hear if this measurement method shows any interesting improvements on your hardware. regards, tom lane diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c index e05e97c6de..d535dbe408 100644 --- a/src/test/modules/test_shm_mq/setup.c +++ b/src/test/modules/test_shm_mq/setup.c @@ -135,6 +135,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers, hdr->workers_total = nworkers; hdr->workers_attached = 0; hdr->workers_ready = 0; + hdr->messages_xfred = 0; shm_toc_insert(toc, 0, hdr); /* Set up one message queue per worker, plus one. */ diff --git a/src/test/modules/test_shm_mq/test_shm_mq.h b/src/test/modules/test_shm_mq/test_shm_mq.h index a666121834..ff35211811 100644 --- a/src/test/modules/test_shm_mq/test_shm_mq.h +++ b/src/test/modules/test_shm_mq/test_shm_mq.h @@ -32,6 +32,7 @@ typedef struct int workers_total; int workers_attached; int workers_ready; + uint64 messages_xfred; } test_shm_mq_header; /* Set up dynamic shared memory and background workers for test run. */ diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c index 9b037b98fe..5aceb8ca73 100644 --- a/src/test/modules/test_shm_mq/worker.c +++ b/src/test/modules/test_shm_mq/worker.c @@ -31,7 +31,8 @@ static void attach_to_queues(dsm_segment *seg, shm_toc *toc, int myworkernumber, shm_mq_handle **inqhp, shm_mq_handle **outqhp); -static void copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh); +static void copy_messages(volatile test_shm_mq_header *hdr, + shm_mq_handle *inqh, shm_mq_handle *outqh); /* * Background worker entrypoint. @@ -131,7 +132,7 @@ test_shm_mq_main(Datum main_arg) SetLatch(®istrant->procLatch); /* Do the work. */ - copy_messages(inqh, outqh); + copy_messages(hdr, inqh, outqh); /* * We're done. For cleanliness, explicitly detach from the shared memory @@ -173,7 +174,8 @@ attach_to_queues(dsm_segment *seg, shm_toc *toc, int myworkernumber, * after this point is cleanup. */ static void -copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh) +copy_messages(volatile test_shm_mq_header *hdr, + shm_mq_handle *inqh, shm_mq_handle *outqh) { Size len; void *data; @@ -189,7 +191,12 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh) if (res != SHM_MQ_SUCCESS) break; - /* Send it back out. */ + /* Create some contention on the mutex. */ + SpinLockAcquire(&hdr->mutex); + ++hdr->messages_xfred; + SpinLockRelease(&hdr->mutex); + + /* Send the message back out. */ res = shm_mq_send(outqh, len, data, false, true); if (res != SHM_MQ_SUCCESS) break;
pgsql-hackers by date: