Thread: LISTEN / NOTIFY performance in 8.3
Hi, I'm noticing a strange increase in the amount of time it takes to issue a NOTIFY statement. I have an existing app that provides a producer / consumer type of queue and that uses the LISTEN / NOTIFY mechanism to signal the consumers of new items arriving in the queue. The consumers then process these items and issue a notify to signal that they have been processed. In the past issuing these notifications happened very quickly, now on 8.3 I'm seeing all of them taking over 300ms and many of them taking 1500ms or more! The notifications are happening outside of any transactions (which is itself a probable area for performance improvement, I realize) but I'm wondering what might have changed between 8.1 (the version I was using in the past) and 8.3? TIA, Joel
Joel Stevenson <joelstevenson@mac.com> writes: > I have an existing app that provides a producer / consumer type of > queue and that uses the LISTEN / NOTIFY mechanism to signal the > consumers of new items arriving in the queue. The consumers then > process these items and issue a notify to signal that they have been > processed. In the past issuing these notifications happened very > quickly, now on 8.3 I'm seeing all of them taking over 300ms and many > of them taking 1500ms or more! That's strange, I would not have thought that listen/notify behavior would change at all. How are you measuring this delay exactly? Can you put together a self-contained test case? regards, tom lane
At 11:58 PM -0500 2/23/08, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >>> That's strange, I would not have thought that listen/notify behavior >>> would change at all. How are you measuring this delay exactly? >>> Can you put together a self-contained test case? > >> Attached is a perl script that sort of simulates what's going on. > >Thanks for the script. It's not showing any particular problems here, >though. With log_min_duration_statement = 10, the only statements that >(slightly) exceed 10ms are the select count(*) from generate_series(1, >15000) ones. > >> Also of note, the iowait percentages on this quad core linux box jump >> to 30-40% while this test script is running, event though there's no >> table activity involved and the producer consumers pause for up to a >> second between iterations. > >This sounds a bit like pg_listener has gotten bloated. Try a "VACUUM >VERBOSE pg_listener" (as superuser) and see what it says. At the moment (server is inactive): pcdb=# VACUUM VERBOSE pg_listener; INFO: vacuuming "pg_catalog.pg_listener" INFO: "pg_listener": removed 1 row versions in 1 pages INFO: "pg_listener": found 1 removable, 21 nonremovable row versions in 28 pages DETAIL: 0 dead row versions cannot be removed yet. There were 2319 unused item pointers. 28 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM running the test script and then the above command: pcdb=# VACUUM VERBOSE pg_listener; INFO: vacuuming "pg_catalog.pg_listener" INFO: "pg_listener": removed 693 row versions in 12 pages INFO: "pg_listener": found 693 removable, 21 nonremovable row versions in 28 pages DETAIL: 0 dead row versions cannot be removed yet. There were 2308 unused item pointers. 28 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM Numerous notifications took 1000ms or so to complete in the test script execution between those two vacuum runs. >If that is the problem then the next question is why it got so much more >bloated than you were used to --- something wrong with vacuuming >procedures, perhaps? I have autovacuum on and using default settings. I have an explicit vacuum routine that runs nightly over the whole DB. -Joel
Joel Stevenson <joelstevenson@mac.com> writes: >> This sounds a bit like pg_listener has gotten bloated. Try a "VACUUM >> VERBOSE pg_listener" (as superuser) and see what it says. > At the moment (server is inactive): > pcdb=# VACUUM VERBOSE pg_listener; > INFO: vacuuming "pg_catalog.pg_listener" > INFO: "pg_listener": removed 1 row versions in 1 pages > INFO: "pg_listener": found 1 removable, 21 nonremovable row versions > in 28 pages OK, that destroys the table-bloat theory. Just to make sure, I pre-populated pg_listener with enough dead rows to make 28 pages, but I still don't see any slow notifies or noticeable load in vmstat. That server is not quite "inactive", though. What are the 21 remaining pg_listener entries for? Is it possible that those jobs are having some impact on the ones run by the test script? Also, it might be worth enabling log_lock_waits to see if the slow notifies are due to having to wait on some lock or other. regards, tom lane
At 1:57 PM -0500 2/24/08, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >>> This sounds a bit like pg_listener has gotten bloated. Try a "VACUUM >>> VERBOSE pg_listener" (as superuser) and see what it says. > >> At the moment (server is inactive): > >> pcdb=# VACUUM VERBOSE pg_listener; >> INFO: vacuuming "pg_catalog.pg_listener" >> INFO: "pg_listener": removed 1 row versions in 1 pages >> INFO: "pg_listener": found 1 removable, 21 nonremovable row versions >> in 28 pages > >OK, that destroys the table-bloat theory. Just to make sure, I >pre-populated pg_listener with enough dead rows to make 28 pages, >but I still don't see any slow notifies or noticeable load in vmstat. > >That server is not quite "inactive", though. What are the 21 remaining >pg_listener entries for? Is it possible that those jobs are having >some impact on the ones run by the test script? > >Also, it might be worth enabling log_lock_waits to see if the slow >notifies are due to having to wait on some lock or other. The other listeners are the application's consumers and producer. At the time of the testing they were not active but were alive. For isolation I've just shutdown all listener / notifier processes that were using the box, vacuumed pg_listener, and run the test script again. There were several LISTEN or NOTIFY statements that took longer than expected to complete (default test script settings of 5 consumers and a loop of 100): 2008-02-24 23:00:48 PST 7541 LOG: duration: 514.697 ms statement: LISTEN to_consumer 2008-02-24 23:00:48 PST 7544 LOG: duration: 508.790 ms statement: LISTEN to_consumer 2008-02-24 23:00:48 PST 7543 LOG: duration: 511.061 ms statement: LISTEN to_consumer 2008-02-24 23:00:48 PST 7545 LOG: duration: 506.390 ms statement: LISTEN to_producer 2008-02-24 23:00:57 PST 7544 LOG: duration: 400.595 ms statement: NOTIFY to_producer 2008-02-24 23:00:57 PST 7538 LOG: duration: 369.018 ms statement: NOTIFY to_producer 2008-02-24 23:01:03 PST 7544 LOG: duration: 410.588 ms statement: NOTIFY to_producer 2008-02-24 23:01:03 PST 7541 LOG: duration: 300.774 ms statement: NOTIFY to_producer 2008-02-24 23:01:32 PST 7545 LOG: duration: 325.380 ms statement: NOTIFY to_consumer 2008-02-24 23:01:42 PST 7538 LOG: duration: 349.640 ms statement: NOTIFY to_producer 2008-02-24 23:01:43 PST 7543 LOG: duration: 529.700 ms statement: NOTIFY to_producer -Joel
>Also, it might be worth enabling log_lock_waits to see if the slow >notifies are due to having to wait on some lock or other. Turning on log_lock_waits shows that there is a lot of waiting for locks on the pg_listener table ala: process 22791 still waiting for ExclusiveLock on relation 2614 of database 16387 after 992.397 ms ... process 22791 acquired ExclusiveLock on relation 2614 of database 16387 after 1433.152 ms deadlock_timeout is left at the default 1s setting. Though these are being generated during application activity - running the simulation script does produce 300ms - 600ms NOTIFY statements but doesn't (at the moment) trigger a lock_wait log entry.
>Also, it might be worth enabling log_lock_waits to see if the slow >notifies are due to having to wait on some lock or other. Turning on log_lock_waits shows that there is a lot of waiting for locks on the pg_listener table ala: process 22791 still waiting for ExclusiveLock on relation 2614 of database 16387 after 992.397 ms ... process 22791 acquired ExclusiveLock on relation 2614 of database 16387 after 1433.152 ms deadlock_timeout is left at the default 1s setting. Though these are being generated during application activity - running the simulation script does produce 300ms - 600ms NOTIFY statements but doesn't (at the moment) trigger a lock_wait log entry.
Joel Stevenson <joelstevenson@mac.com> writes: >> Also, it might be worth enabling log_lock_waits to see if the slow >> notifies are due to having to wait on some lock or other. > Turning on log_lock_waits shows that there is a lot of waiting for > locks on the pg_listener table ala: Interesting. The LISTEN/NOTIFY mechanism itself takes ExclusiveLock on pg_listener, but never for very long at a time (assuming pg_listener doesn't get horribly bloated, which we know isn't happening for you). Another thought that comes to mind is that maybe the delays you see come from these lock acquisitions getting blocked behind autovacuums of pg_listener. I did not see that while trying to replicate your problem, but maybe the issue requires more update load on pg_listener than the test script can create by itself, or maybe some nondefault autovacuum setting is needed --- what are you using? regards, tom lane
At 1:13 PM -0500 2/25/08, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >>> Also, it might be worth enabling log_lock_waits to see if the slow >>> notifies are due to having to wait on some lock or other. > >> Turning on log_lock_waits shows that there is a lot of waiting for >> locks on the pg_listener table ala: > >Interesting. The LISTEN/NOTIFY mechanism itself takes ExclusiveLock >on pg_listener, but never for very long at a time (assuming pg_listener >doesn't get horribly bloated, which we know isn't happening for you). > >Another thought that comes to mind is that maybe the delays you see >come from these lock acquisitions getting blocked behind autovacuums of >pg_listener. I did not see that while trying to replicate your problem, >but maybe the issue requires more update load on pg_listener than the >test script can create by itself, or maybe some nondefault autovacuum >setting is needed --- what are you using? Default autovacuum settings. I turned on all autovacuum logging and cranked up the test script and have it fork 25 consumers each running 25 iterations. At that level on my machine I can get the lock waiting to exceed the 1s deadlock_timeout right away but the autovacuum activity on pg_listener is entirely absent until the end when the forked consumers are mostly done and disconnected.
Joel Stevenson <joelstevenson@mac.com> writes: > I turned on all autovacuum logging and cranked up the test script and > have it fork 25 consumers each running 25 iterations. At that level > on my machine I can get the lock waiting to exceed the 1s > deadlock_timeout right away but the autovacuum activity on > pg_listener is entirely absent until the end when the forked > consumers are mostly done and disconnected. Hmph. At 25/100 I can get a few complaints about NOTIFY taking more than 20ms, but it seems to be due to blocking behind autovacuum, as in this example: 2008-02-25 14:53:41.812 EST 13773 LOG: automatic vacuum of table "joels.pg_catalog.pg_listener": index scans: 0 pages: 0 removed, 78 remain tuples: 5560 removed, 25 remain system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec 2008-02-25 14:53:41.850 EST 13773 LOG: automatic analyze of table "joels.pg_catalog.pg_listener" system usage: CPU 0.00s/0.00usec elapsed 0.03 sec 2008-02-25 14:53:41.851 EST 13728 LOG: duration: 29.270 ms statement: NOTIFY to_producer 2008-02-25 14:53:41.852 EST 13758 LOG: duration: 22.835 ms statement: NOTIFY to_producer It's weird that the behavior is robust for you but I can't make it happen at all. Would you show the output of pg_config, as well as all your nondefault postgresql.conf settings? regards, tom lane
At 2:57 PM -0500 2/25/08, Tom Lane wrote: >It's weird that the behavior is robust for you but I can't make it >happen at all. Would you show the output of pg_config, as well as >all your nondefault postgresql.conf settings? pg_config: BINDIR = /usr/local/pgsql/bin DOCDIR = /usr/local/pgsql/doc INCLUDEDIR = /usr/local/pgsql/include PKGINCLUDEDIR = /usr/local/pgsql/include INCLUDEDIR-SERVER = /usr/local/pgsql/include/server LIBDIR = /usr/local/pgsql/lib PKGLIBDIR = /usr/local/pgsql/lib LOCALEDIR = MANDIR = /usr/local/pgsql/man SHAREDIR = /usr/local/pgsql/share SYSCONFDIR = /usr/local/pgsql/etc PGXS = /usr/local/pgsql/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = 'CFLAGS=-O2 -pipe' '--with-openssl' '--enable-thread-safety' '--with-includes=/usr/kerberos/include' '--with-perl' CC = gcc CPPFLAGS = -D_GNU_SOURCE -I/usr/kerberos/include CFLAGS = -O2 -pipe -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wdeclaration-after-statement -fno-strict-aliasing CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,'/usr/local/pgsql/lib' LDFLAGS_SL = LIBS = -lpgport -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm VERSION = PostgreSQL 8.3.0 Non-default postgresql.conf settings: max_connections = 80 ssl = on shared_buffers = 1GB work_mem = 100MB maintenance_work_mem = 100MB max_fsm_pages = 204800 vacuum_cost_delay = 100 wal_buffers = 124kB wal_writer_delay = 200ms commit_delay = 100 checkpoint_segments = 6 effective_cache_size = 6GB
At 11:58 PM -0500 2/23/08, Tom Lane wrote: > > Attached is a perl script that sort of simulates what's going on. > >Thanks for the script. It's not showing any particular problems here, >though. With log_min_duration_statement = 10, the only statements that >(slightly) exceed 10ms are the select count(*) from generate_series(1, >15000) ones. I tried the test script on another machine (similar but not identical to the original machine) running 8.3 and although the notify performance was *much* better than the original I still see notifications taking longer than the select count(*) from generate_series(1, 15000) queries, and also longer than some simple updates to other tables that are also happening on the server. duration: 10.030 ms statement: select count(*) from generate_series(1, 15000) duration: 224.833 ms statement: NOTIFY to_producer Perhaps this shouldn't be made much of as I'm sure there are many way that this could quite naturally happen. I've been thinking of LISTEN / NOTIFY as one of the least expensive and therefore speedy ways to get the word out to participating processes that something has changed (versus using a manually setup signals table that interested parties updated and selected from). Now that I see a little bit more of what goes on under the hood of this function I see that it's still basically table-driven and I'll adjust my expectations accordingly, but I'm still puzzled by the hugely slow notifications showing up on the original server running the producer / consumer setup. With ps I can see some postgres backends with a 'notify interrupt waiting' command line during the tests - could it be an issue with signal handling on the original machine - something entirely outside of PG's control? Thx, -Joel
Joel Stevenson <joelstevenson@mac.com> writes: > Now that I see a little bit more of what goes on under the hood of > this function I see that it's still basically table-driven and I'll > adjust my expectations accordingly, Yeah, there's been discussion of replacing the implementation with some all-in-memory queue kind of setup, but no one's got round to that yet. > With ps I can see some postgres backends with a 'notify interrupt > waiting' command line during the tests - could it be an issue with > signal handling on the original machine - something entirely outside > of PG's control? No, that's not unexpected if you have the same notify being delivered to multiple processes that had been idle. They'll all get wakened and try to read pg_listener to see what happened, but since this is a read-modify-write type of operation it uses an exclusive lock, so only one can clear its pg_listener entry at a time. The 'waiting' ones you are seeing are stacked up behind whichever one has the lock at the moment. They shouldn't be waiting for long. I'm still baffled by why we aren't seeing comparable performance for the same test case. What I'm testing on is couple-year-old desktop kit (dual 2.8GHz Xeon, consumer-grade disk drive) --- I had assumed your server would be at least as fast as that, but maybe not? regards, tom lane
At 12:43 PM -0500 2/26/08, Tom Lane wrote: >I'm still baffled by why we aren't seeing comparable performance for the >same test case. What I'm testing on is couple-year-old desktop kit >(dual 2.8GHz Xeon, consumer-grade disk drive) --- I had assumed your >server would be at least as fast as that, but maybe not? It's a quad-core Xeon 3.0Ghz machine, 7200rpm SATA discs in a software RAID. It's not a particularly high performance machine in terms of disc IO - but again the comparative speed of most select-update-commit queries to plain notify's on the server seem off. What's really baffling is that there are plenty of other OLTP queries going in multiple backends simultaneously that don't fall over my 300ms query log threshold, and yet NOTIFY and LISTEN consistently do. What's more it's looks like it's only happening for registered listener relnames. This is while the server processes are alive but inactive: joels=# \timing Timing is on. joels=# select * from pg_listener; relname | listenerpid | notification ----------------+-------------+-------------- alert_inbound | 15013 | 0 alert_inbound | 13371 | 0 alert_inbound | 26856 | 0 alert_inbound | 12016 | 0 alert_inbound | 26911 | 0 alert_inbound | 11956 | 0 alert_process | 13365 | 0 alert_inbound | 26855 | 0 alert_inbound | 12248 | 0 alert_inbound | 13367 | 0 alert_inbound | 12304 | 0 alert_inbound | 32633 | 0 alert_inbound | 30979 | 0 alert_inbound | 29290 | 0 alert_inbound | 30394 | 0 alert_inbound | 14490 | 0 alert_inbound | 14491 | 0 alert_inbound | 14492 | 0 (18 rows) Time: 0.402 ms joels=# notify foo; NOTIFY Time: 0.244 ms joels=# notify foo2; NOTIFY Time: 0.211 ms joels=# notify alert_process; NOTIFY Time: 34.585 ms joels=# notify alert_process; NOTIFY Time: 45.554 ms joels=# notify alert_inbound; NOTIFY Time: 40.868 ms joels=# notify alert_inbound; NOTIFY Time: 176.309 ms joels=# notify alert_inbound; NOTIFY Time: 36.669 ms joels=# notify alert_inbound; NOTIFY Time: 369.761 ms joels=# notify alert_inbound; NOTIFY Time: 34.449 ms joels=# notify alert_inbound; NOTIFY Time: 121.990 ms joels=# notify foo3; NOTIFY Time: 0.250 ms joels=# notify foo2; NOTIFY Time: 0.175 ms There's no autovacuum log entries prior to or immediately after the 369ms notify command. -Joel
Tom Lane wrote: > read-modify-write type of operation it uses an exclusive lock, so only > one can clear its pg_listener entry at a time. The 'waiting' ones you > are seeing are stacked up behind whichever one has the lock at the > moment. They shouldn't be waiting for long. > I certainly hadn't expected that to be the implementation technique - isn't it smply that we need a sngle flag per worker process and can set/test-and-clear with atomic operations and then a signal to wake them up? Anyway - how hard would it be to install triggers on commit and rollback? Then we could write our own mechanisms. James
Joel Stevenson <joelstevenson@mac.com> writes: > What's really baffling is that there are plenty of other OLTP queries > going in multiple backends simultaneously that don't fall over my > 300ms query log threshold, and yet NOTIFY and LISTEN consistently do. > What's more it's looks like it's only happening for registered > listener relnames. Hmm, that says that it's not a matter of locking on pg_listener, but of actually applying the row update(s) and/or signaling the recipient(s). If you're not seeing performance issues for ordinary table-update operations it's hard to see why pg_listener updates would be any worse, so that seems to point the finger at the signaling. Which is just a matter of a kill(2) and shouldn't be that expensive. It might be interesting to try strace'ing the whole PG process tree while these notifies are going on, and seeing if you can identify any specific kernel calls that seem to take a long time. regards, tom lane
James Mansion <james@mansionfamily.plus.com> writes: > I certainly hadn't expected that to be the implementation technique - > isn't it smply that we need > a sngle flag per worker process and can set/test-and-clear with atomic > operations and then a > signal to wake them up? Hardly --- how's that going to pass a notify name? Also, a lot of people want some payload data in a notify, not just a condition name; any reimplementation that doesn't address that desire probably won't get accepted. There's lots of threads in the -hackers archives about reimplementing listen/notify in a saner fashion. Personally I lean towards using something much like the sinval queue. regards, tom lane
Tom Lane wrote: > Hardly --- how's that going to pass a notify name? Also, a lot of > people want some payload data in a notify, not just a condition name; > any reimplementation that doesn't address that desire probably won't > get accepted. > Ah - forgot about the name. At least there need be just one instance of a name record queued per worker if I'm reading the documentation right - it suggest that notifications can be folded with the proviso that if the process generates a notification and at least one other process generates a notification then it will get at least (but possibly only) two events. Not sure why the PID is there rather than a couple of flag bits. You'll alsways have the danger of overflowing a shm area and need to spill: is the signal and then lookup in storage materially quicker than using the master process to route messages via pipes? As you say, you have a lock contention issue and often the total signal data volume outstanding for a single back end will be less than will fit in a kernel's pipe buffer. The sending processes can track what signals they've generated in the current transaction so the master (or signal distributor) needn't get bombarded with signals from lots of rows within one transaction.
At 6:01 PM -0500 2/26/08, Tom Lane wrote: >Hmm, that says that it's not a matter of locking on pg_listener, >but of actually applying the row update(s) and/or signaling the >recipient(s). If you're not seeing performance issues for ordinary >table-update operations it's hard to see why pg_listener updates would >be any worse, so that seems to point the finger at the signaling. >Which is just a matter of a kill(2) and shouldn't be that expensive. > >It might be interesting to try strace'ing the whole PG process tree >while these notifies are going on, and seeing if you can identify >any specific kernel calls that seem to take a long time. I ran PG via strace and then ran the test script at 25 consumers looping 25 times each. There were no other connections to the database for this strace run and test. Digging through the strace file is a bit mind-numbing but here's some signs that semop and send are the culprits: 3 misc examples coming from near LISTEN or NOTIFIES: - - - - - - 7495 18:10:40.251855 <... semop resumed> ) = 0 <1.006149> 7495 18:10:41.325442 <... semop resumed> ) = -1 EINTR (Interrupted system call) <0.299135> 7495 18:10:41.998219 <... semop resumed> ) = 0 <0.603566> A chunk of log following the action on fd 7 (slow recv on ERESTARTSYS) and then the slow semop that follows: - - - - - - - - 7495 18:10:42.576401 send(7, "C\0\0\0\vNOTIFY\0Z\0\0\0\5I", 18, 0 <unfinished ...> 7495 18:10:42.576503 <... send resumed> ) = 18 <0.000070> 7495 18:10:42.576620 recv(7, <unfinished ...> 7495 18:10:42.873796 <... recv resumed> 0x8331d40, 8192, 0) = ? ERESTARTSYS (To be restarted) <0.297158> 7495 18:10:42.873911 --- SIGUSR2 (User defined signal 2) @ 0 (0) --- 7495 18:10:42.874079 gettimeofday( <unfinished ...> 7495 18:10:42.874198 <... gettimeofday resumed> {1204078242, 874097}, NULL) = 0 <0.000101> 7495 18:10:42.874324 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={1, 0}}, <unfinished ...> 7495 18:10:42.874470 <... setitimer resumed> NULL) = 0 <0.000121> 7495 18:10:42.874604 semop(50495522, 0xbfff9764, 1 <unfinished ...> 7495 18:10:43.678431 <... semop resumed> ) = 0 <0.803809> A little further on: - - - - - - - - 7495 18:10:44.905320 <... semop resumed> ) = -1 EINTR (Interrupted system call) <0.998192> I'm not sure what exactly that means, in terms of next steps. I'll dig more through the strace file and see if I can find anything else but those look to be definite bottlenecks for some reason. At 2:24 PM -0800 2/26/08, Maurice Aubrey wrote: >What's the OS/Dist? Red Hat Enterprise Linux ES release 3 (Taroon Update 8) Thx, -Joel