Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize LISTEN/NOTIFY
Date
Msg-id dab234b5-a10c-4fb5-a2b1-ce725d3e3020@app.fastmail.com
Whole thread Raw
In response to Re: Optimize LISTEN/NOTIFY  ("Joel Jacobson" <joel@compiler.org>)
List pgsql-hackers
On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
> On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
>> Attached, please find a new version rebased on top of the bug fix
>> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
>> 1b46990.
>
> To help reviewers, here is a new write-up of the patch:
> [...write-up...]

While reviewing all the comments in async.c to make sure they match the
patch's code, I noticed a few discrepancies. One of them was the comment
above QUEUE_CLEANUP_DELAY, which again made me think about how master
currently uses that value as the threshold for when to "wake laggers".

In this patch, QUEUE_CLEANUP_DELAY is no longer used for that
purpose; it now only determines how often we try to advance the tail
pointer.

I realize someone who is familiar with the current code in master,
might ask the following question:

    Why not do direct advancement, but just use the old "wake laggers"
    logic for listeners that lag behind more than QUEUE_CLEANUP_DELAY?

On the surface it might look like a plausible alternative,
since that's what master currently does (but for other databases).

I was curious to see how such alternative approach would affect
performance, so I changed SignalBackends and ran some benchmarks.

Below is the v27 logic:

```
if (QUEUE_BACKEND_IS_ADVANCING(i) ?
    QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
    QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
{
    Assert(pid != InvalidPid);
    QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
    pids[count] = pid;
    procnos[count] = i;
    count++;
}
else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
          QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
    Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
    QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
```

And here is the modified "wake laggers" version I tested:

```
if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
    !QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite) &&
     QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
    QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
else if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
                            QUEUE_POS_PAGE(pos)) >= QUEUE_CLEANUP_DELAY)
{
    QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
    pids[count] = pid;
    procnos[count] = i;
    count++;
}
```

This preserves direct advancement under the same conditions as the patch,
but only sends signals to backends that are "laggers" by
the QUEUE_CLEANUP_DELAY threshold, similar to master's behavior.

It turns out the "wake laggers" approach is significantly slower:

"wake laggers":

./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 80871 sent (4650/s), 80871 received (4650/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          380 (0.5%) avg: 0.085ms
 0.10-1.00ms     #          2765 (3.4%) avg: 0.289ms
 1.00-10.00ms    #          4947 (6.1%) avg: 5.870ms
 10.00-100.00ms  #######    57467 (71.1%) avg: 52.101ms
>100.00ms       #          15312 (18.9%) avg: 119.847ms

v27:

% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 229866 sent (13985/s), 229865 received (13984/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          18351 (8.0%) avg: 0.072ms
 0.10-1.00ms     #          32899 (14.3%) avg: 0.315ms
 1.00-10.00ms    ####       103273 (44.9%) avg: 5.055ms
 10.00-100.00ms  ###        75342 (32.8%) avg: 18.197ms
>100.00ms                  0 (0.0%) avg: 0.000ms

I reran the experiments a three times with similar results.
Also tested other test permutaions, that also showed "wake laggers" was worse.

At first glance, both approaches ultimately signal all backends
interested in our notifies, so it may seem surprising that latency
differs this much. The key point is what happens to non-interested
backends:

A backend that is *not* listening to our channels may have stopped at a
position before the old queue head, because it woke up and read the head
before all notifies for a previous commit were written. Such backend
might actually be interested in the notifies that lie in between its
current position and the old queue head, and it could therefore be
urgent to wake it up to make it process the queue and delivery the
notifies.

In v27, such a backend gets signaled on the next NOTIFY, when it notice
it is stationary at a pos behind the queueHeadBeforeWrite.

With "wake laggers", however, it receives no signal until it becomes a
"lagger" by QUEUE_CLEANUP_DELAY pages, which can be far later. This
risks delaying its processing and the delivery of notifications it is
interested in.

In master this is fine because "wake laggers" is only applied to
backends in other databases that we know are not interested in our
notifications.

In conclusion, the "wake laggers" mechanism seems inherently
incompatible with the direct advancement mechanism, and I therefore
think the current approach in v27 is sound.

I just wanted to share this reasoning, not because anyone has raised any
concerns about this, but because I hadn't fully internalized this myself,
and thought the reasoning could possibly be helpful to others.

The attached v28 is the same as v27, except some comments have been
fixed to accurately reflect the code.

/Joel
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Use opresulttype instead of calling SearchSysCache1() in match_orclause_to_indexcol()
Next
From: Tom Lane
Date:
Subject: Re: Improve error reporting in 027_stream_regress test