Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize LISTEN/NOTIFY
Date
Msg-id 02a7cd37-e2fc-4212-8b19-f8c239c95fb8@app.fastmail.com
Whole thread Raw
In response to Re: Optimize LISTEN/NOTIFY  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> The attached proof-of-concept patch proposes a straightforward
>> optimization for the single-listener case. It introduces a shared-memory
>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>> listener.
>
> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.

New benchmark confirm this hypothesis.

New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:

1. Partitioned Hash Locking

The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS
lightweight locks. A given channel is mapped to a partition lock using
a custom hash function on (dboid, channelname).

This allows LISTEN/UNLISTEN operations on different channels to proceed
concurrently without fighting over a single global lock, addressing the
"many distinct channels" use-case.

2. Optimistic Read-Locking

For the "many backends on one channel" use-case, lock acquisition now follows
a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the
channel's state. If the channel is already marked as has_multiple_listeners,
we can return immediately without any need for a write.

Only if we are the first or second listener on a channel do we release
the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry.
After getting the exclusive lock, we re-verify the state to guard against
race conditions. This avoids serializing the third and all subsequent
listeners for a popular channel.

BENCHMARK


https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png


https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png

I didn't want to attached the images to this email because they are quite large,
due to all the details in the images.

However, since it's important this mailing list contains all relevant data discussed,
I've also included all data in the graphs formatted in ASCII/Markdown:

performance_overview.md

I've also included the raw parsed data from the pgbench output,
which has been used as input to create performance_overview.md
as well as the images:

pgbench_results_combined.csv

I've benchmarked five times per measurement, in random order.
All raw measurements have been included in the Markdown document
within { curly braces } sorted, next to the average values, to get an idea
of the variance. Stddev felt possibly misleading since I'm not sure the
data points are normally distributed, since it's benchmarking data.

I've run the benchmarks on my MacBook Pro Apple M3 Max,
using `caffeinate -dims pgbench ...`.

>> The patch also includes a "wake only tail" optimization (contributed by
>> Marko Tikkaja) to help prevent backends from falling too far behind.
>
> Coulda sworn we dealt with that case some years ago.  In any case,
> if it's independent of the other idea it should probably get its
> own thread.

Maybe it's been dealt with by some other part of the system, but I can't
find any such code anywhere, it's only async.c that currently sends
PROCSIG_NOTIFY_INTERRUPT.

The wake only tail mechanism seems almost perfect, but I can think of at least
one edge-case where we could still get a problem situation:

With lots of idle backends, the rate of this one-by-one catch-up may not be fast
enough to outpace the queue's advancement, causing other idle backends
to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold.

To ensure all backends are eventually processed without re-introducing
the thundering herd problem, an additional mechanism seems neessary:

I see two main options:

1. Extend the chain reaction
Once woken, a backend could signal the next backend at the queue tail,
propagating the catch-up process. This would need to be managed carefully,
perhaps with some kind of global advisory lock, to prevent multiple
cascades from running at once.

2. Centralize the work
We already have the autovacuum daemon, maybe it could also be made responsible
for kicking lagging backends?

Other ideas?

/Joel

Attached:

* pgbench-scripts.tar.gz
pgbench scripts to reproduce the results, report and images.

*  performance_overview.md
Same results as in the images, but in ASCII/Markdown format.

* pgbench_results_combined.csv
Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images.

* 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch
Old patch just renamed to -v2

* 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch
New patch with the approach explained above.
Attachment

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: [PATCH] Generate random dates/times in a specified range
Next
From: Rahila Syed
Date:
Subject: Re: Improve LWLock tranche name visibility across backends