Thread: fixing LISTEN/NOTIFY
Applications that frequently use LISTEN/NOTIFY can suffer from performance problems because of the MVCC bloat created by frequent insertions into pg_listener. A solution to this has been suggested in the past: rewrite LISTEN/NOTIFY to use shared memory rather than system catalogs. The problem is that there is a static amount of shared memory and a potentially unbounded number of notifications, so we can run out of memory. There are two ways to solve this: we can do as sinval does and clear the shared memory queue, then effectively issue a NOTIFY ALL that awakens all listeners. I don't like this behaviour: it seems ugly to expose an implementation detail (static sizing of shared memory) to applications. While a lot of applications are only using LISTEN/NOTIFY for cache invalidation (and so spurious notifications are just a performance hit), this behaviour still seems unfortunate to me. Using NOTIFY ALL also makes NOTIFY 'msg' far less useful, which is a feature several users have asked for in the past. I think it would be better to either fail the NOTIFY when there is not enough shared memory to add a new notification to the queue, or have the NOTIFY block until shared memory does become available (applications could of course implement the latter on top of the former by using savepoints and a loop, either on the client-side or in PL/PgSQL). I guess we could add an option to NOTIFY to specify how to handle failures. A related question is when to add the notification to the shared memory queue. We don't want the notification to fire until the NOTIFY's transaction commits, so one alternative would be to delay appending to the queue until transaction-commit time. However, that would mean we wouldn't notice NOTIFY failure until the end of the transaction, or else that we would block waiting for free space during the transaction-commit process. I think it would be better to add an entry to shared memory during the NOTIFY itself, and stamp that entry with the NOTIFY's toplevel XID. Other backends can read that the notification immediately (and once all the backends have seen it, the notification can be removed from the queue). Each backend can use the XID to determine when to "fire" the notification (and if the notifying backend rolls back, they can just discard the notification). This scheme is more expensive when the notifying transaction rolls back, but I don't think that is the common case. Comments? (I'm still thinking about how to organize the shared memory queue, and whether any of the sinval stuff can be reused...) -Neil
Neil Conway <neilc@samurai.com> writes: > [ various ideas about reimplementing LISTEN/NOTIFY ] I really dislike the idea of pushing notifies out to the shared queue before commit. That essentially turns "forever do notify foo" into a global DOS tool: you can drive everybody else's backend into swap hell along with your own. The idea of blocking during commit until shmem becomes available might work. There's some issues here about transaction atomicity, though: how do you guarantee that all or none of your notifies get sent? (Actually, supposing that the notifies ought to be sent post-commit, "all" is the only acceptable answer. So maybe you just never give up.) regards, tom lane
On Thu, 2005-06-10 at 01:14 -0400, Tom Lane wrote: > The idea of blocking during commit until shmem becomes available might > work. There's some issues here about transaction atomicity, though: > how do you guarantee that all or none of your notifies get sent? > (Actually, supposing that the notifies ought to be sent post-commit, > "all" is the only acceptable answer. So maybe you just never give up.) Yeah, I think that would work. We could also write to shared memory before the commit proper, and embed an XID in the message to allow other backends to determine when/if to fire the notification. However, I don't really like the idea of blocking the backend for a potentially significant amount of time in a state half-way between "committed" and "ready for the next query". -Neil
On Thu, Oct 06, 2005 at 01:32:32AM -0400, Neil Conway wrote: > On Thu, 2005-06-10 at 01:14 -0400, Tom Lane wrote: > > The idea of blocking during commit until shmem becomes available might > > work. There's some issues here about transaction atomicity, though: > > how do you guarantee that all or none of your notifies get sent? > > (Actually, supposing that the notifies ought to be sent post-commit, > > "all" is the only acceptable answer. So maybe you just never give up.) > > Yeah, I think that would work. We could also write to shared memory > before the commit proper, and embed an XID in the message to allow other > backends to determine when/if to fire the notification. > > However, I don't really like the idea of blocking the backend for a > potentially significant amount of time in a state half-way between > "committed" and "ready for the next query". I don't like the idea of blocking indefinitely. It means another global DOS tool for anybody trying to NOTIFY: just do a LISTEN and sit there doing nothing. One idea would be to block for a while, with a timeout. If it expires, the receiving backend(s) has to copy the notification to local memory and lets go of the one in shared memory. -- Alvaro Herrera http://www.advogato.org/person/alvherre "I'm always right, but sometimes I'm more right than other times." (LinusTorvalds)
Neil Conway <neilc@samurai.com> writes: > However, I don't really like the idea of blocking the backend for a > potentially significant amount of time in a state half-way between > "committed" and "ready for the next query". I wonder whether we could use something comparable to pg_multixact or pg_subtrans, to convert the problem from one of "need to fit in fixed amount of memory" to one of "it's on disk with some buffers in memory". regards, tom lane
On Thu, Oct 06, 2005 at 09:12:58AM -0400, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > However, I don't really like the idea of blocking the backend for a > > potentially significant amount of time in a state half-way between > > "committed" and "ready for the next query". > > I wonder whether we could use something comparable to pg_multixact > or pg_subtrans, to convert the problem from one of "need to fit in > fixed amount of memory" to one of "it's on disk with some buffers > in memory". The multixact mechanism seems a perfect fit to me (variable length contents, identifiers produced serially and destroyed in a not-too-distant future). In fact I proposed it awhile back. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "At least to kernel hackers, who really are human, despite occasional rumors to the contrary" (LWN.net)
First of all, I'd like to give a name to the thing that a backend listens to. The code talks about "listening to a relation", but as the comments there point out, the name doesn't actually identify a relation. I'm going to call it a topic from now on. I'd like to point out that we don't really have a notification queue, but a notification flag for each topic. We're not constrained by the number of notifications, but by the number of topics. It might make sense to change the semantics so that we never lose a notification, if we're going to implement NOTIFY 'msg', but that's another discussion. I've been thinking about the options for shmem data structure. Replacing pg_listener in the straightforward way would give us an array: struct { char topicname[NAMEDATALEN]; int listenerpid; int notification; } listenertable[max_subscriptions]; Where max_subscriptions is the maximum number of active LISTENs. If we're ready to take the performance hit, we can exploit the fact that it's signal extra backends, as long as those backends can figure out that it was a false alarm. In fact, we can always signal all backends. Exploiting that, we could have: struct { char topicname[NAMEDATALEN]; int subscriptions; /* number of active subscriptions for this topic */ int notification_counter;/* increase by 1 on each NOTIFY */ } listenertable[max_topics] NOTIFY increases the notification_counter by one, and signals all backends. Every backend keeps a private list of (topicname, notification_counter) pairs for the topics it's subscribed to, in addition to the shared memory table. The signal handler compares the notification_counter in the private list and in the listenertable. If they don't match, notify the client. If they match, it was a false alarm. The shmem requirement of this is max_topics * (NAMEDATALEN + sizeof(int) * 2) If we're not ready to take the performance hit, we can add the list of backends to shmem: struct { char topicname[NAMEDATALEN]; int subscriptions; /* number of active subscriptions for this topic */ int notification_counter;/* increase by 1 on each NOTIFY */ int subscribers[max_backends]; } listenertable[max_topics] and only signal those backends that are in the subscribers array. The shmem requirement of this is max_topics * (NAMEDATALEN + sizeof(int) * (2 + max_backends)) We can also do a tradeoff between shmem usage and unnecessary signals: struct { char topicname[NAMEDATALEN]; int subscriptions; /* number of active subscriptions for this topic */ int notification_counter;/* increase by 1 on each NOTIFY */ int subscriber_cache[cache_size]; } listenertable[max_topics] Where cache_size can be any number max_topics * (NAMEDATALEN + sizeof(int) * (2 + cache_size)) Where cache_size can be anything between 0 and max_backends. If the cache gets full, NOTIFY signals all backends. Otherwise, only those that are in the cache. If max_topics is large, a hash table should be used instead of an array of structs. Now that I think of it, using the notification_counter, we *can* guarantee that no notification is lost. The signal handler just needs to notify the client (shmem notification_counter) - (private notification_counter) times. - Heikki On Thu, 6 Oct 2005, Neil Conway wrote: > Applications that frequently use LISTEN/NOTIFY can suffer from > performance problems because of the MVCC bloat created by frequent > insertions into pg_listener. A solution to this has been suggested in > the past: rewrite LISTEN/NOTIFY to use shared memory rather than system > catalogs. > > The problem is that there is a static amount of shared memory and a > potentially unbounded number of notifications, so we can run out of > memory. There are two ways to solve this: we can do as sinval does and > clear the shared memory queue, then effectively issue a NOTIFY ALL that > awakens all listeners. I don't like this behaviour: it seems ugly to > expose an implementation detail (static sizing of shared memory) to > applications. While a lot of applications are only using LISTEN/NOTIFY > for cache invalidation (and so spurious notifications are just a > performance hit), this behaviour still seems unfortunate to me. Using > NOTIFY ALL also makes NOTIFY 'msg' far less useful, which is a feature > several users have asked for in the past. > > I think it would be better to either fail the NOTIFY when there is not > enough shared memory to add a new notification to the queue, or have the > NOTIFY block until shared memory does become available (applications > could of course implement the latter on top of the former by using > savepoints and a loop, either on the client-side or in PL/PgSQL). I > guess we could add an option to NOTIFY to specify how to handle > failures. > > A related question is when to add the notification to the shared memory > queue. We don't want the notification to fire until the NOTIFY's > transaction commits, so one alternative would be to delay appending to > the queue until transaction-commit time. However, that would mean we > wouldn't notice NOTIFY failure until the end of the transaction, or else > that we would block waiting for free space during the transaction-commit > process. I think it would be better to add an entry to shared memory > during the NOTIFY itself, and stamp that entry with the NOTIFY's > toplevel XID. Other backends can read that the notification immediately > (and once all the backends have seen it, the notification can be removed > from the queue). Each backend can use the XID to determine when to > "fire" the notification (and if the notifying backend rolls back, they > can just discard the notification). This scheme is more expensive when > the notifying transaction rolls back, but I don't think that is the > common case. > > Comments? (I'm still thinking about how to organize the shared memory > queue, and whether any of the sinval stuff can be reused...) > > -Neil > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > It might make sense to change the semantics so that we never lose a > notification, if we're going to implement NOTIFY 'msg', but that's another > discussion. That's pretty much a given --- the ability to pass some payload data in notifications has been on the TODO list for a very long time. I don't think we're going to reimplement listen/notify without adding it. I like your suggestion of "topic" for the notify name, and am tempted to go fix the documentation to use that term right now ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I like your suggestion of "topic" for the notify name, and am tempted to > go fix the documentation to use that term right now ... Fwiw, I think the more conventional word here would be "channel". But whatever. -- greg
On Thu, Oct 06, 2005 at 10:30:24AM -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > It might make sense to change the semantics so that we never lose a > > notification, if we're going to implement NOTIFY 'msg', but that's another > > discussion. > > That's pretty much a given --- the ability to pass some payload data in > notifications has been on the TODO list for a very long time. I don't > think we're going to reimplement listen/notify without adding it. Maybe I'm missing something, but is it possible to ensure notifications aren't lost using Heikki's method, since everything's only in shared memory? Or is the idea that stuff would not survive a backend crash? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Maybe I'm missing something, but is it possible to ensure notifications > aren't lost using Heikki's method, since everything's only in shared > memory? Or is the idea that stuff would not survive a backend crash? Listen/notify state has never survived a crash (since it is defined in terms of PIDs that will no longer exist after a DB restart), and I don't really see any reason why we'd want it to. An application reconnecting after a DB crash would have to assume it might have missed some notifications occurring before it could reconnect, and would have to re-determine what the database state is anyway. But I think you might be confusing that with the feature-or-bug (depending on one's point of view) that duplicate notifications can be merged into one event. I'm inclined to preserve that behavior, primarily because not doing so would create a tremendous penalty on applications that expect it to work that way. With addition of payload data it'd be easy for apps that don't want merging to prevent it: just add an otherwise-uninteresting serial number to the payload string. We'd certainly want to define the "duplicate" test to consider the payload string as well as the topic name. regards, tom lane
On Sat, Oct 08, 2005 at 04:59:22PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Maybe I'm missing something, but is it possible to ensure notifications > > aren't lost using Heikki's method, since everything's only in shared > > memory? Or is the idea that stuff would not survive a backend crash? > > Listen/notify state has never survived a crash (since it is defined in > terms of PIDs that will no longer exist after a DB restart), and I don't > really see any reason why we'd want it to. An application reconnecting > after a DB crash would have to assume it might have missed some > notifications occurring before it could reconnect, and would have to > re-determine what the database state is anyway. > > But I think you might be confusing that with the feature-or-bug > (depending on one's point of view) that duplicate notifications can be > merged into one event. I'm inclined to preserve that behavior, > primarily because not doing so would create a tremendous penalty on > applications that expect it to work that way. With addition of payload > data it'd be easy for apps that don't want merging to prevent it: just > add an otherwise-uninteresting serial number to the payload string. > We'd certainly want to define the "duplicate" test to consider the > payload string as well as the topic name. I thought the idea behind NOTIFY 'msg' was to enable some form of queuing, or at least something that ensures the notification sticks around until picked up. I see I was wrong. :) Before ripping out the old code, would it be useful as the basis for a queue/notification system that ensures a message sticks around until at least one listener picks it up? Granted, I think such a thing could be built using the new notification system along with a table, but if the code's already there and useful... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Tom Lane said: > But I think you might be confusing that with the feature-or-bug > (depending on one's point of view) that duplicate notifications can be > merged into one event. I'm inclined to preserve that behavior, > primarily because not doing so would create a tremendous penalty on > applications that expect it to work that way. What sort of application are you envisioning? If you mean there are applications for which avoiding duplicate notifications is a performance win, I think those applications are on shakey ground: the LISTEN/NOTIFY interface doesn't guarantee that no duplicate notifications will be delivered (it merely doesn't guarantee they *will* be delivered). Personally, I think delivering all notifications by default is simpler behavior for the application programmer to understand. If you want to avoid doing work for duplicate notifications, you realistically need to implement that yourself anyway. -Neil
Neil, Jim, All: > Personally, I think delivering all notifications by default is simpler > behavior for the application programmer to understand. If you want to > avoid doing work for duplicate notifications, you realistically need to > implement that yourself anyway. I agree with Neil; I don't really see duplicate notifications as a problem, provided that we can deliver a message with the notice. As long as we can deliver a PK id or other unique information, then we can still use NOTIFY to do, for example, asynchronous summary counts. However, I really dislike the idea of blocking on out-of-memory NOTICEs. While I can see some DBAs wanting that, for most applications I'd want to implement, I'd rather that the NOTICE were thrown away than have it stop a query. LISTEN/NOTIFY is a asynchronous messaging mechanism, after all -- the whole idea is for it *not* to slow down transactions. Ideally, we'd have three options which could be set administratively (at startup) in postgresql.conf: notify_mem = 16384 #in K notify_mem_full = discard #one of: discard, block, or disk# discard = simply drop notices over the memory limit# block =stop queries from committing until notice memory is open# disk = spill notices to the pg_notice file You'll notice that I added a spill-to-disk on NOTIFY; I think that should be a TODO even if we don't implement it immediately. We should also have a way to detect the situation of notify_mem being full, both through log notices and by command-line function. Finally, we'd need a function for the superuser to clear all notices. I'm thinking here of my personal application for a better LISTEN/NOTIFY: using pg_memcached to store fast asynchronous materialized views. The idea would be that you could maintain a materialized view in memory that was no more than 30 seconds behind, by: [every 15 seconds]: check if there are any notices on the viewif so, regenerate the rows indicated in the notify messageif not, next check if notify_mem is fullif so, regenerate all mattviews from scratchsend clear_notices(); Hopefully people get the idea? -- Josh Berkus Aglio Database Solutions San Francisco
"Neil Conway" <neilc@samurai.com> writes: > Tom Lane said: >> I'm inclined to preserve that behavior, >> primarily because not doing so would create a tremendous penalty on >> applications that expect it to work that way. > What sort of application are you envisioning? The ones that have a per-row trigger that does "NOTIFY foo". In the past this would deliver one event per transaction; changing that to one per row is going to kill them. I'm not very concerned about whether similar events issued by different transactions are merged or not --- as you say, one could never rely on that to happen anyway because of timing. But one event per transaction has been a reliable behavior and I think it would be bad to change it. regards, tom lane