Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: Optimize LISTEN/NOTIFY |
Date | |
Msg-id | 02a7cd37-e2fc-4212-8b19-f8c239c95fb8@app.fastmail.com Whole thread Raw |
In response to | Re: Optimize LISTEN/NOTIFY (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> The attached proof-of-concept patch proposes a straightforward >> optimization for the single-listener case. It introduces a shared-memory >> hash table mapping (dboid, channelname) to the ProcNumber of a single >> listener. > > What does that do to the cost and parallelizability of LISTEN/UNLISTEN? Good point. The previous patch would effectively force all LISTEN/UNLISTEN to be serialized, which would at least hurt parallelizability. New benchmark confirm this hypothesis. New patch attached that combines two complementary approaches, that together seems to scale well for both common-channel and unique-channel scenarios: 1. Partitioned Hash Locking The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS lightweight locks. A given channel is mapped to a partition lock using a custom hash function on (dboid, channelname). This allows LISTEN/UNLISTEN operations on different channels to proceed concurrently without fighting over a single global lock, addressing the "many distinct channels" use-case. 2. Optimistic Read-Locking For the "many backends on one channel" use-case, lock acquisition now follows a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the channel's state. If the channel is already marked as has_multiple_listeners, we can return immediately without any need for a write. Only if we are the first or second listener on a channel do we release the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry. After getting the exclusive lock, we re-verify the state to guard against race conditions. This avoids serializing the third and all subsequent listeners for a popular channel. BENCHMARK https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png I didn't want to attached the images to this email because they are quite large, due to all the details in the images. However, since it's important this mailing list contains all relevant data discussed, I've also included all data in the graphs formatted in ASCII/Markdown: performance_overview.md I've also included the raw parsed data from the pgbench output, which has been used as input to create performance_overview.md as well as the images: pgbench_results_combined.csv I've benchmarked five times per measurement, in random order. All raw measurements have been included in the Markdown document within { curly braces } sorted, next to the average values, to get an idea of the variance. Stddev felt possibly misleading since I'm not sure the data points are normally distributed, since it's benchmarking data. I've run the benchmarks on my MacBook Pro Apple M3 Max, using `caffeinate -dims pgbench ...`. >> The patch also includes a "wake only tail" optimization (contributed by >> Marko Tikkaja) to help prevent backends from falling too far behind. > > Coulda sworn we dealt with that case some years ago. In any case, > if it's independent of the other idea it should probably get its > own thread. Maybe it's been dealt with by some other part of the system, but I can't find any such code anywhere, it's only async.c that currently sends PROCSIG_NOTIFY_INTERRUPT. The wake only tail mechanism seems almost perfect, but I can think of at least one edge-case where we could still get a problem situation: With lots of idle backends, the rate of this one-by-one catch-up may not be fast enough to outpace the queue's advancement, causing other idle backends to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold. To ensure all backends are eventually processed without re-introducing the thundering herd problem, an additional mechanism seems neessary: I see two main options: 1. Extend the chain reaction Once woken, a backend could signal the next backend at the queue tail, propagating the catch-up process. This would need to be managed carefully, perhaps with some kind of global advisory lock, to prevent multiple cascades from running at once. 2. Centralize the work We already have the autovacuum daemon, maybe it could also be made responsible for kicking lagging backends? Other ideas? /Joel Attached: * pgbench-scripts.tar.gz pgbench scripts to reproduce the results, report and images. * performance_overview.md Same results as in the images, but in ASCII/Markdown format. * pgbench_results_combined.csv Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images. * 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch Old patch just renamed to -v2 * 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch New patch with the approach explained above.
Attachment
pgsql-hackers by date: