Re: Optimize LISTEN/NOTIFY - Mailing list pgsql-hackers

From Chao Li
Subject Re: Optimize LISTEN/NOTIFY
Date
Msg-id 1F7227F5-C33D-4E2C-8511-33F1468590D0@gmail.com
Whole thread Raw
In response to Re: Optimize LISTEN/NOTIFY  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Optimize LISTEN/NOTIFY
List pgsql-hackers


On Oct 15, 2025, at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Joel Jacobson" <joel@compiler.org> writes:
Having investigated this, the "direct advancement" approach seems
correct to me.

(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)

Right.  What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in.  That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.

There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer.  This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.

A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability.  However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway.  (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)

It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.


I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile.

I just composed an example of race condition, please see if it is valid.

Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen.

Say a listener A, it’s head pointing to 1.

And current QueueHead is 1.

Now two notifiers B and C are committing:
 * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.
 * Now QueueHead is 3.
 * C enters PreCommit_Notify(),  it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5.
 * Now QueueHead is 5
 * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head.
 * A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3.
 * At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time.

I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count.


Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:

I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead.  That seems like it could easily be a net loss.

Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.


I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.

In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory.

Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash.

And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash.

WRT to v6, I got a few new comments:

1 - 0002
```
  *  After commit we are called another time (AtCommit_Notify()). Here we
- *  make any actual updates to the effective listen state (listenChannels).
+ *  make any actual updates to the effective listen state (channelHash).
  *  Then we signal any backends that may be interested in our messages
  *  (including our own backend, if listening).  This is done by
- *  SignalBackends(), which scans the list of listening backends and sends a
- *  PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- *  know which backend is listening on which channel so we must signal them
- *  all).  We can exclude backends that are already up to date, though, and
- *  we can also exclude backends that are in other databases (unless they
- *  are way behind and should be kicked to make them advance their
- *  pointers).
+ *  SignalBackends(), which consults the shared channel hash table to
+ *  identify listeners for the channels that have pending notifications
+ *  in the current database.  Each selected backend is marked as having a
+ *  wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ *  signal is sent to it.
```

In this comment, you refer to “channelHash” and “the shared channel hash table”, they are the same thing, but easy to make readers to misunderstand.

2 - 0002
```
 pg_listening_channels(PG_FUNCTION_ARGS)
 {
  FuncCallContext *funcctx;
+ List   *listenChannels;
 
  /* stuff done only on the first call of the function */
  if (SRF_IS_FIRSTCALL())
  {
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
  /* create a function context for cross-call persistence */
  funcctx = SRF_FIRSTCALL_INIT();
```

listenChannels is only used within the “if”, so it’s definition can be moved into the “if”.

3 - 0002
```
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+  QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
```

I wonder why this check is needed. If queue_length is 0, can we return immediately from SignalBackends()?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




pgsql-hackers by date:

Previous
From: "Aya Iwata (Fujitsu)"
Date:
Subject: RE: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Next
From: "Joel Jacobson"
Date:
Subject: Re: Optimize LISTEN/NOTIFY