Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
| Date | |
| Msg-id | 66213fee-00ff-4952-802d-c06454e521ac@iki.fi Whole thread Raw |
| In response to | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue ("Joel Jacobson" <joel@compiler.org>) |
| Responses |
Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
| List | pgsql-hackers |
On 03/11/2025 23:45, Joel Jacobson wrote: > On Mon, Nov 3, 2025, at 12:02, Heikki Linnakangas wrote: >> I wrote another little stand-alone performance test program for this, >> attached. It launches N connections that send NOTIFYs to a single >> channel as fast as possible, and M threads that listen for the >> notifications. I ran it with different combinations of N and M, on >> 'master' and on REL_14_STABLE (which didn't have SLRU banks) and I >> cannot discern any performance difference from these patches. So it >> seems that holding the SLRU (bank) lock across the >> TransactionIdDidCommit() calls is fine. > > Nice! That for the benchmark code! I took the liberty of hacking a bit > on it, and added support for multiple channels, with separate listener > and notifier threads per channel. Each notification now carries the > notifier ID, a sequence number, and a send timestamp. Listeners verify > that sequence numbers arrive in order and record delivery latency. The > program collects latency measurements into fixed buckets and reports > them once per second together with total and per-second send/receive > counts. > > Also added a short delay before starting notifiers so that listeners > have time to issue their LISTEN commands, and a new --channels option, > and the meaning of --listeners and --notifiers was changed to apply per > channel. > > Also fixed so the code could be compiled outside of the PostgreSQL > source code repo, if wanting to build this as stand-alone tool. > > I've benchmarked master vs 0001+0002 and can't notice any differences; > see attached output from benchmark runs. Thanks. After some further testing, I was able to find a scenario where this patch significantly reduces performance: if the listening backends subscribe to a massive number of channels, like 10000, they spend a lot of time scanning the linked list of subscribed channels in IsListeningOn(). With the patch, those checks were performed while holding the SLRU lock, and it started to show up as lock contention between notifiers and listeners. To demonstrate that, attached is another version of the test program that adds an --extra-channels=N argument. If you set it to e.g. 10000, each listener backends calls LISTEN on 10000 additional channels that are never notified. They just make the listenChannels list longer. With that and the patches I posted previously, I'm getting: $ PGHOST=localhost PGDB=postgres://localhost/postgres ./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1 --extra-channels=10000 10 s: 12716 sent (1274/s), 635775 received (63737/s) 0.00-0.01ms 0 (0.0%) avg: 0.000ms 0.01-0.10ms 0 (0.0%) avg: 0.000ms 0.10-1.00ms # 1915 (0.3%) avg: 0.807ms 1.00-10.00ms ######### 633550 (99.7%) avg: 3.502ms 10.00-100.00ms # 310 (0.0%) avg: 11.423ms >100.00ms 0 (0.0%) avg: 0.000ms ^C Whereas on 'master', I see about 2-3x more notifies/s: $ PGHOST=localhost PGDB=postgres://localhost/postgres ./async-notify-test-3 --listeners=50 --notifiers=4 --channels=1 --extra-channels=10000 10 s: 32057 sent (3296/s), 1602995 received (164896/s) 0.00-0.01ms 0 (0.0%) avg: 0.000ms 0.01-0.10ms # 11574 (0.7%) avg: 0.078ms 0.10-1.00ms ###### 1082960 (67.6%) avg: 0.577ms 1.00-10.00ms ### 508199 (31.7%) avg: 1.489ms 10.00-100.00ms # 262 (0.0%) avg: 16.178ms >100.00ms 0 (0.0%) avg: 0.000ms ^C Fortunately that's easy to fix: We can move the IsListeningOn() check after releasing the lock. See attached. The elephant in the room of course is that a lookup in a linked list is O(n) and it would be very straightforward to replace it with e.g. a hash table. We should do that irrespective of this bug fix. But I'm inclined to do it as a separate followup patch. - Heikki
Attachment
pgsql-hackers by date: