Thread: listen/notify argument (old topic revisited)
A while ago, I started a small discussion about passing arguments to a NOTIFY so that the listening backend could get more information about the event. There wasn't exactly a consensus from what I understand, but the last thing I remember is that someone intended to speed up the notification process by storing the events in shared memory segments (IIRC this was Tom's idea). That would create a remote possibility of a spurious notification, but the idea is that the listening application can check the status and determine what happened. I looked at the TODO, but I couldn't find anything, nor could I find anything in the docs. Is someone still interested in implementing this feature? Are there still people who disagree with the above implementation strategy? Regards,Jeff
Re: listen/notify argument (old topic revisited)
From
nconway@klamath.dyndns.org (Neil Conway)
Date:
On Tue, Jul 02, 2002 at 02:37:19AM -0700, Jeff Davis wrote: > A while ago, I started a small discussion about passing arguments to a NOTIFY > so that the listening backend could get more information about the event. Funny, I was just about to post to -hackers about this. > There wasn't exactly a consensus from what I understand, but the last thing I > remember is that someone intended to speed up the notification process by > storing the events in shared memory segments (IIRC this was Tom's idea). That > would create a remote possibility of a spurious notification, but the idea is > that the listening application can check the status and determine what > happened. Yes, that was Tom Lane. IMHO, we need to replace the existing pg_listener scheme with an improved model if we want to make any significant improvements to asynchronous notifications. In summary, the two designs that have been suggested are: pg_notify: a new system catalog, stores notifications only -- pg_listener stores only listening backends. shmem: all notifications are done via shared memory and not stored in system catalogs at all, in a manner similar tothe cache invalidation code that already exists. This avoids the MVCC-induced performence problem with storing notificationin system catalogs, but can lead to spurrious notifications -- the statically sized buffer in which notificationsare stored can overflow. Applications will be able to differentiate between overflow-induced and regular messages. > Is someone still interested in implementing this feature? Are there still > people who disagree with the above implementation strategy? Some people objected to shmem at the time; personally, I'm not really sure which design is best. Any comments from -hackers? If there's a consensus on which route to take, I'll probably implement the preferred design for 7.3. However, I think that a proper implementation of notify messages will need an FE/BE protocol change, so that will need to wait for 7.4. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Jeff Davis wrote: > A while ago, I started a small discussion about passing arguments to a NOTIFY > so that the listening backend could get more information about the event. > > There wasn't exactly a consensus from what I understand, but the last thing I > remember is that someone intended to speed up the notification process by > storing the events in shared memory segments (IIRC this was Tom's idea). That > would create a remote possibility of a spurious notification, but the idea is > that the listening application can check the status and determine what > happened. I don't see a huge value to using shared memory. Once we get auto-vacuum, pg_listener will be fine, and shared memory like SI is just too hard to get working reliabily because of all the backends reading/writing in there. We have tables that have the proper sharing semantics; I think we should use those and hope we get autovacuum soon. As far as the message, perhaps passing the oid of the pg_listener row to the backend would help, and then the backend can look up any message for that oid in pg_listener. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Why can't we do efficient indexing, or clear out the table? I don't > > remember. > > I don't recall either, but I do recall that we tried to index it and > backed out the changes. In any case, a table on disk is just plain > not the right medium for transitory-by-design notification messages. OK, I can help here. I added an index on pg_listener so lookups would go faster in the backend, but inserts/updates into the table also require index additions, and your feeling was that the table was small and we would be better without the index and just sequentially scanning the table. I can easily add the index and make sure it is used properly if you are now concerned about table access time. I think your issue was that it is only looked up once, and only updated once, so there wasn't much sense in having that index maintanance overhead, i.e. you only used the index once per row. (I remember the item being on TODO for quite a while when we discussed this.) Of course, a shared memory system probably is going to either do it sequentailly or have its own index issues, so I don't see a huge advantage to going to shared memory, and I do see extra code and a queue limit. > >> A curious statement considering that PG depends critically on SI > >> working. This is a solved problem. > > > My point is that SI was buggy for years until we found all the bugs, so > > yea, it is a solved problem, but solved with difficulty. > > The SI message mechanism itself was not the source of bugs, as I recall > it (although certainly the code was incomprehensible in the extreme; > the original programmer had absolutely no grasp of readable coding style > IMHO). The problem was failure to properly design the interactions with > relcache and catcache, which are pretty complex in their own right. > An SI-like NOTIFY mechanism wouldn't have those issues. Oh, OK, interesting. So _that_ was the issue there. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Let me tell you what would be really interesting. If we didn't report the pid of the notifying process and we didn't allow arbitrary strings for notify (just pg_class relation names), we could just add a counter to pg_class that is updated for every notify. If a backend is listening, it remembers the counter at listen time, and on every commit checks the pg_class counter to see if it has incremented. That way, there is no queue, no shared memory, and there is no scanning. You just pull up the cache entry for pg_class and look at the counter. One problem is that pg_class would be updated more frequently. Anyway, just an idea. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is disk i/o a real performance > > penalty for notify, and is performance a huge issue for notify anyway, > > Yes, and yes. I have used NOTIFY in production applications, and I know > that performance is an issue. > > >> The queue limit problem is a valid argument, but it's the only valid > >> complaint IMHO; and it seems a reasonable tradeoff to make for the > >> other advantages. > > BTW, it occurs to me that as long as we make this an independent message > buffer used only for NOTIFY (and *not* try to merge it with SI), we > don't have to put up with overrun-reset behavior. The overrun reset > approach is useful for SI because there are only limited times when > we are prepared to handle SI notification in the backend work cycle. > However, I think a self-contained NOTIFY mechanism could be much more > flexible about when it will remove messages from the shared buffer. > Consider this: > > 1. To send NOTIFY: grab write lock on shared-memory circular buffer. > If enough space, insert message, release lock, send signal, done. > If not enough space, release lock, send signal, sleep some small > amount of time, and then try again. (Hard failure would occur only > if the proposed message size exceeds the buffer size; as long as we > make the buffer size a parameter, this is the DBA's fault not ours.) > > 2. On receipt of signal: grab read lock on shared-memory circular > buffer, copy all data up to write pointer into private memory, > advance my (per-process) read pointer, release lock. This would be > safe to do pretty much anywhere we're allowed to malloc more space, > so it could be done say at the same points where we check for cancel > interrupts. Therefore, the expected time before the shared buffer > is emptied after a signal is pretty small. > > In this design, if someone sits in a transaction for a long time, > there is no risk of shared memory overflow; that backend's private > memory for not-yet-reported NOTIFYs could grow large, but that's > his problem. (We could avoid unnecessary growth by not storing > messages that don't correspond to active LISTENs for that backend. > Indeed, a backend with no active LISTENs could be left out of the > circular buffer participation list altogether.) > > We'd need to separate this processing from the processing that's used to > force SI queue reading (dz's old patch), so we'd need one more signal > code than we use now. But we do have SIGUSR1 available. > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I don't see a huge value to using shared memory. Once we get > auto-vacuum, pg_listener will be fine, No it won't. The performance of notify is *always* going to suck as long as it depends on going through a table. This is particularly true given the lack of any effective way to index pg_listener; the more notifications you feed through, the more dead rows there are with the same key... > and shared memory like SI is just > too hard to get working reliabily because of all the backends > reading/writing in there. A curious statement considering that PG depends critically on SI working. This is a solved problem. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Why can't we do efficient indexing, or clear out the table? I don't > remember. I don't recall either, but I do recall that we tried to index it and backed out the changes. In any case, a table on disk is just plain not the right medium for transitory-by-design notification messages. >> A curious statement considering that PG depends critically on SI >> working. This is a solved problem. > My point is that SI was buggy for years until we found all the bugs, so > yea, it is a solved problem, but solved with difficulty. The SI message mechanism itself was not the source of bugs, as I recall it (although certainly the code was incomprehensible in the extreme; the original programmer had absolutely no grasp of readable coding style IMHO). The problem was failure to properly design the interactions with relcache and catcache, which are pretty complex in their own right. An SI-like NOTIFY mechanism wouldn't have those issues. regards, tom lane
Jeff Davis wrote: > On Tuesday 02 July 2002 06:03 pm, Bruce Momjian wrote: > > Let me tell you what would be really interesting. If we didn't report > > the pid of the notifying process and we didn't allow arbitrary strings > > for notify (just pg_class relation names), we could just add a counter > > to pg_class that is updated for every notify. If a backend is > > listening, it remembers the counter at listen time, and on every commit > > checks the pg_class counter to see if it has incremented. That way, > > there is no queue, no shared memory, and there is no scanning. You just > > pull up the cache entry for pg_class and look at the counter. > > > > One problem is that pg_class would be updated more frequently. Anyway, > > just an idea. > > I think that currently a lot of people use select() (after all, it's mentioned > in the docs) in the frontend to determine when a notify comes into a > listening backend. If the backend only checks on commit, and the backend is > largely idle except for notify processing, might it be a while before the > frontend realizes that a notify was sent? I meant to check exactly when it does now; when a query completes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tuesday 02 July 2002 06:03 pm, Bruce Momjian wrote: > Let me tell you what would be really interesting. If we didn't report > the pid of the notifying process and we didn't allow arbitrary strings > for notify (just pg_class relation names), we could just add a counter > to pg_class that is updated for every notify. If a backend is > listening, it remembers the counter at listen time, and on every commit > checks the pg_class counter to see if it has incremented. That way, > there is no queue, no shared memory, and there is no scanning. You just > pull up the cache entry for pg_class and look at the counter. > > One problem is that pg_class would be updated more frequently. Anyway, > just an idea. I think that currently a lot of people use select() (after all, it's mentioned in the docs) in the frontend to determine when a notify comes into a listening backend. If the backend only checks on commit, and the backend is largely idle except for notify processing, might it be a while before the frontend realizes that a notify was sent? Regards,Jeff > > --------------------------------------------------------------------------- > > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Is disk i/o a real performance > > > penalty for notify, and is performance a huge issue for notify anyway, > > > > Yes, and yes. I have used NOTIFY in production applications, and I know > > that performance is an issue. > > > > >> The queue limit problem is a valid argument, but it's the only valid > > >> complaint IMHO; and it seems a reasonable tradeoff to make for the > > >> other advantages. > > > > BTW, it occurs to me that as long as we make this an independent message > > buffer used only for NOTIFY (and *not* try to merge it with SI), we > > don't have to put up with overrun-reset behavior. The overrun reset > > approach is useful for SI because there are only limited times when > > we are prepared to handle SI notification in the backend work cycle. > > However, I think a self-contained NOTIFY mechanism could be much more > > flexible about when it will remove messages from the shared buffer. > > Consider this: > > > > 1. To send NOTIFY: grab write lock on shared-memory circular buffer. > > If enough space, insert message, release lock, send signal, done. > > If not enough space, release lock, send signal, sleep some small > > amount of time, and then try again. (Hard failure would occur only > > if the proposed message size exceeds the buffer size; as long as we > > make the buffer size a parameter, this is the DBA's fault not ours.) > > > > 2. On receipt of signal: grab read lock on shared-memory circular > > buffer, copy all data up to write pointer into private memory, > > advance my (per-process) read pointer, release lock. This would be > > safe to do pretty much anywhere we're allowed to malloc more space, > > so it could be done say at the same points where we check for cancel > > interrupts. Therefore, the expected time before the shared buffer > > is emptied after a signal is pretty small. > > > > In this design, if someone sits in a transaction for a long time, > > there is no risk of shared memory overflow; that backend's private > > memory for not-yet-reported NOTIFYs could grow large, but that's > > his problem. (We could avoid unnecessary growth by not storing > > messages that don't correspond to active LISTENs for that backend. > > Indeed, a backend with no active LISTENs could be left out of the > > circular buffer participation list altogether.) > > > > We'd need to separate this processing from the processing that's used to > > force SI queue reading (dz's old patch), so we'd need one more signal > > code than we use now. But we do have SIGUSR1 available. > > > > regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Of course, a shared memory system probably is going to either do it > sequentailly or have its own index issues, so I don't see a huge > advantage to going to shared memory, and I do see extra code and a queue > limit. Disk I/O vs. no disk I/O isn't a huge advantage? Come now. A shared memory system would use sequential (well, actually circular-buffer) access, which is *exactly* what you want given the inherently sequential nature of the messages. The reason that table storage hurts is that we are forced to do searches, which we could eliminate if we had control of the storage ordering. Again, it comes down to the fact that tables don't provide the right abstraction for this purpose. The "extra code" argument doesn't impress me either; async.c is currently 900 lines, about 2.5 times the size of sinvaladt.c which is the guts of SI message passing. I think it's a good bet that a SI-like notify module would be much smaller than async.c is now; it's certainly unlikely to be significantly larger. The queue limit problem is a valid argument, but it's the only valid complaint IMHO; and it seems a reasonable tradeoff to make for the other advantages. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I don't see a huge value to using shared memory. Once we get > > auto-vacuum, pg_listener will be fine, > > No it won't. The performance of notify is *always* going to suck > as long as it depends on going through a table. This is particularly > true given the lack of any effective way to index pg_listener; the > more notifications you feed through, the more dead rows there are > with the same key... Why can't we do efficient indexing, or clear out the table? I don't remember. > > and shared memory like SI is just > > too hard to get working reliabily because of all the backends > > reading/writing in there. > > A curious statement considering that PG depends critically on SI > working. This is a solved problem. My point is that SI was buggy for years until we found all the bugs, so yea, it is a solved problem, but solved with difficulty. Do we want to add another SI-type capability that could be as difficult to get working properly, or will the notify piggyback on the existing SI code. If that latter, that would be fine with me, but we still have the overflow queue problem. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Of course, a shared memory system probably is going to either do it > > sequentailly or have its own index issues, so I don't see a huge > > advantage to going to shared memory, and I do see extra code and a queue > > limit. > > Disk I/O vs. no disk I/O isn't a huge advantage? Come now. My assumption is that it throws to disk as backing store, which seems better to me than dropping the notifies. Is disk i/o a real performance penalty for notify, and is performance a huge issue for notify anyway, assuming autovacuum? > A shared memory system would use sequential (well, actually > circular-buffer) access, which is *exactly* what you want given > the inherently sequential nature of the messages. The reason that > table storage hurts is that we are forced to do searches, which we > could eliminate if we had control of the storage ordering. Again, > it comes down to the fact that tables don't provide the right > abstraction for this purpose. To me, it just seems like going to shared memory is taking our existing table structure and moving it to memory. Yea, there is no tuple header, and yea we can make a circular list, but we can't index the thing, so is spinning around a circular list any better than a sequential scan of a table. Yea, we can delete stuff better, but autovacuum would help with that. It just seems like we are reinventing the wheel. Are there other uses for this? Can we make use of RAM-only tables? > The "extra code" argument doesn't impress me either; async.c is > currently 900 lines, about 2.5 times the size of sinvaladt.c which is > the guts of SI message passing. I think it's a good bet that a SI-like > notify module would be much smaller than async.c is now; it's certainly > unlikely to be significantly larger. > > The queue limit problem is a valid argument, but it's the only valid > complaint IMHO; and it seems a reasonable tradeoff to make for the > other advantages. I am just not excited about it. What do others think? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Is disk i/o a real performance > penalty for notify, and is performance a huge issue for notify anyway, Yes, and yes. I have used NOTIFY in production applications, and I know that performance is an issue. >> The queue limit problem is a valid argument, but it's the only valid >> complaint IMHO; and it seems a reasonable tradeoff to make for the >> other advantages. BTW, it occurs to me that as long as we make this an independent message buffer used only for NOTIFY (and *not* try to merge it with SI), we don't have to put up with overrun-reset behavior. The overrun reset approach is useful for SI because there are only limited times when we are prepared to handle SI notification in the backend work cycle. However, I think a self-contained NOTIFY mechanism could be much more flexible about when it will remove messages from the shared buffer. Consider this: 1. To send NOTIFY: grab write lock on shared-memory circular buffer. If enough space, insert message, release lock, send signal, done. If not enough space, release lock, send signal, sleep some small amount of time, and then try again. (Hard failure would occur only if the proposed message size exceeds the buffer size; as long as we make the buffer size a parameter, this is the DBA's fault not ours.) 2. On receipt of signal: grab read lock on shared-memory circular buffer, copy all data up to write pointer into private memory, advance my (per-process) read pointer, release lock. This would be safe to do pretty much anywhere we're allowed to malloc more space, so it could be done say at the same points where we check for cancel interrupts. Therefore, the expected time before the shared buffer is emptied after a signal is pretty small. In this design, if someone sits in a transaction for a long time, there is no risk of shared memory overflow; that backend's private memory for not-yet-reported NOTIFYs could grow large, but that's his problem. (We could avoid unnecessary growth by not storing messages that don't correspond to active LISTENs for that backend. Indeed, a backend with no active LISTENs could be left out of the circular buffer participation list altogether.) We'd need to separate this processing from the processing that's used to force SI queue reading (dz's old patch), so we'd need one more signal code than we use now. But we do have SIGUSR1 available. regards, tom lane
> Of course, a shared memory system probably is going to either do it > sequentailly or have its own index issues, so I don't see a huge > advantage to going to shared memory, and I do see extra code and a queue > limit. Is a shared memory implementation going to play silly buggers with the Win32 port? Chris
On Wed, 2002-07-03 at 08:20, Christopher Kings-Lynne wrote: > > Of course, a shared memory system probably is going to either do it > > sequentailly or have its own index issues, so I don't see a huge > > advantage to going to shared memory, and I do see extra code and a queue > > limit. > > Is a shared memory implementation going to play silly buggers with the Win32 > port? Perhaps this is a good place to introduce anonymous mmap ? Is there a way to grow anonymous mmap on demand ? ---------------- Hannu
On Tue, 2002-07-02 at 23:35, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Is disk i/o a real performance > > penalty for notify, and is performance a huge issue for notify anyway, > > Yes, and yes. I have used NOTIFY in production applications, and I know > that performance is an issue. > > >> The queue limit problem is a valid argument, but it's the only valid > >> complaint IMHO; and it seems a reasonable tradeoff to make for the > >> other advantages. > > BTW, it occurs to me that as long as we make this an independent message > buffer used only for NOTIFY (and *not* try to merge it with SI), we > don't have to put up with overrun-reset behavior. The overrun reset > approach is useful for SI because there are only limited times when > we are prepared to handle SI notification in the backend work cycle. > However, I think a self-contained NOTIFY mechanism could be much more > flexible about when it will remove messages from the shared buffer. > Consider this: > > 1. To send NOTIFY: grab write lock on shared-memory circular buffer. Are you planning to have one circular buffer per listening backend ? Would that not be waste of space for large number of backends with long notify arguments ? -------------- Hannu
On Tue, 2002-07-02 at 17:12, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Of course, a shared memory system probably is going to either do it > > > sequentailly or have its own index issues, so I don't see a huge > > > advantage to going to shared memory, and I do see extra code and a queue > > > limit. > > > > Disk I/O vs. no disk I/O isn't a huge advantage? Come now. > > My assumption is that it throws to disk as backing store, which seems > better to me than dropping the notifies. Is disk i/o a real performance > penalty for notify, and is performance a huge issue for notify anyway, > assuming autovacuum? For me, performance would be one of the only concerns. Currently I use two methods of finding changes, one is NOTIFY which directs frontends to reload various sections of data, the second is a table which holds a QUEUE of actions to be completed (which must be tracked, logged and completed). If performance wasn't a concern, I'd simply use more RULES which insert requests into my queue table.
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > Is a shared memory implementation going to play silly buggers with the Win32 > port? No. Certainly no more so than shared disk buffers or the SI message facility, both of which are *not* optional. regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: > Perhaps this is a good place to introduce anonymous mmap ? I don't think so; it just adds a portability variable without buying us anything. > Is there a way to grow anonymous mmap on demand ? Nope. Not portably, anyway. For instance, the HPUX man page for mmap sayeth: If the size of the mapped file changes after the call to mmap(), the effect of references to portions of the mappedregion that correspond to added or removed portions of the file is unspecified. Dynamically re-mmapping after enlarging the file might work, but there are all sorts of interesting constraints on that too; it looks like you'd have to somehow synchronize things so that all the backends do it at the exact same time. On the whole I see no advantage to be gained here, compared to the implementation I sketched earlier with a fixed-size shared buffer and enlargeable internal buffers in backends. regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: > Are you planning to have one circular buffer per listening backend ? No; one circular buffer, period. Each backend would also internally buffer notifies that it hadn't yet delivered to its client --- but since the time until delivery could vary drastically across clients, I think that's reasonable. I'd expect clients that are using LISTEN to avoid doing long-running transactions, so under normal circumstances the internal buffer should not grow very large. regards, tom lane
On Wed, 2002-07-03 at 15:51, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > Are you planning to have one circular buffer per listening backend ? > > No; one circular buffer, period. > > Each backend would also internally buffer notifies that it hadn't yet > delivered to its client --- but since the time until delivery could vary > drastically across clients, I think that's reasonable. I'd expect > clients that are using LISTEN to avoid doing long-running transactions, > so under normal circumstances the internal buffer should not grow very > large. > > regards, tom lane > 2. On receipt of signal: grab read lock on shared-memory circular > buffer, copy all data up to write pointer into private memory, > advance my (per-process) read pointer, release lock. This would be > safe to do pretty much anywhere we're allowed to malloc more space, > so it could be done say at the same points where we check for cancel > interrupts. Therefore, the expected time before the shared buffer > is emptied after a signal is pretty small. > > In this design, if someone sits in a transaction for a long time, > there is no risk of shared memory overflow; that backend's private > memory for not-yet-reported NOTIFYs could grow large, but that's > his problem. (We could avoid unnecessary growth by not storing > messages that don't correspond to active LISTENs for that backend. > Indeed, a backend with no active LISTENs could be left out of the > circular buffer participation list altogether.) There could a little more smartness here to avoid unneccessary copying (not just storing) of not-listened-to data. Perhaps each notify message could be stored as (ptr_to_next_blk,name,data) so that the receiving backend could skip uninetersting (not-listened-to) messages. I guess that depending on the circumstances this can be either faster or slower than copying them all in one memmove. This will be slower if all messages are interesting, this will be an overall win if there is one backend listening to messages with big dataload and lots of other backends listening to relatively small messages. There are scenarios where some more complex structure will be faster (a sparse communication structure, say 1000 backends each listening to 1 name and notifying ten others - each backend has to (manually ;) check 1000 messages to find the one that is for it) but your proposed structure seems good enough for most common uses (and definitely better than the current one) --------------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > There could a little more smartness here to avoid unneccessary copying > (not just storing) of not-listened-to data. Yeah, I was wondering about that too. > I guess that depending on the circumstances this can be either faster or > slower than copying them all in one memmove. The more interesting question is whether it's better to hold the read lock on the shared buffer for the minimum possible amount of time; if so, we'd be better off to pull the data from the buffer as quickly as possible and then sort it later. Determining whether we are interested in a particular notify name will probably take a probe into a (local) hashtable, so it won't be super-quick. However, I think we could arrange for readers to use a sharable lock on the buffer, so having them expend that processing while holding the read lock might be acceptable. My guess is that the actual volume of data going through the notify mechanism isn't going to be all that large, and so avoiding one memcpy step for it isn't going to be all that exciting. I think I'd lean towards minimizing the time spent holding the shared lock, instead. But it's a judgment call. regards, tom lane
On Wed, 2002-07-03 at 16:30, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > There could a little more smartness here to avoid unneccessary copying > > (not just storing) of not-listened-to data. > > Yeah, I was wondering about that too. > > > I guess that depending on the circumstances this can be either faster or > > slower than copying them all in one memmove. > > The more interesting question is whether it's better to hold the read > lock on the shared buffer for the minimum possible amount of time; OTOH, we may decide that getting a notify ASAP is not a priority and just go on doing what we did before if we can't get the lock and try again the next time around. This may have some pathological behaviours (starving some backends who always come late ;), but we are already attracting a thundering herd by sending a signal to all _possibly_ interested backends at the same time time. Keeping a list of who listens to what can solve this problem (but only in case of sparse listening habits). ----------------- Hannu
On Wed, 2002-07-03 at 16:30, Tom Lane wrote: > My guess is that the actual volume of data going through the notify > mechanism isn't going to be all that large, and so avoiding one memcpy > step for it isn't going to be all that exciting. It may become large if we will have an implementation which can cope well with lage volumes :) > I think I'd lean towards minimizing the time spent holding the > shared lock, instead. In case you are waiting for just one message out of 1000 it may still be faster to do selective copying. It is possible that 1000 strcmp's + 1000 pointer traversals are faster than one big memcpy, no ? > But it's a judgment call. If we have a clean C interface + separate PG binding we may write several different modules for different scenarios and let the user choose (even at startup time) - code optimized for messages that everybody wants is bound to be suboptimal for case when they only want 1 out of 1000 messages. Same for different message sizes. ------------- Hannu
Re: listen/notify argument (old topic revisited)
From
nconway@klamath.dyndns.org (Neil Conway)
Date:
On Tue, Jul 02, 2002 at 05:35:42PM -0400, Tom Lane wrote: > 1. To send NOTIFY: grab write lock on shared-memory circular buffer. > If enough space, insert message, release lock, send signal, done. > If not enough space, release lock, send signal, sleep some small > amount of time, and then try again. (Hard failure would occur only > if the proposed message size exceeds the buffer size; as long as we > make the buffer size a parameter, this is the DBA's fault not ours.) How would this interact with the current transactional behavior of NOTIFY? At the moment, executing a NOTIFY command only stores the pending notification in a List in the backend you're connected to; when the current transaction commits, the NOTIFY is actually processed (stored in pg_listener, SIGUSR2 sent, etc) -- if the transaction is rolled back, the NOTIFY isn't sent. If we do the actual insertion when the NOTIFY is executed, I don't see a simple way to get this behavior... Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Hannu Krosing <hannu@tm.ee> writes: > but we are already attracting a thundering herd by > sending a signal to all _possibly_ interested backends at the same time That's why it's so important that the readers use a sharable lock. The only thing they'd be locking out is some new writer trying to send (yet another) notify. Also, it's a pretty important optimization to avoid signaling backends that are not listening for any notifies at all. We could improve on it further by keeping info in shared memory about which backends are listening for which notify names, but I don't see any good way to do that in a fixed amount of space. regards, tom lane
nconway@klamath.dyndns.org (Neil Conway) writes: > How would this interact with the current transactional behavior of > NOTIFY? No change. Senders would only insert notify messages into the shared buffer when they commit (uncommited notifies would live in a list in the sender, same as now). Readers would be expected to remove messages from the shared buffer ASAP after receiving the signal, but they'd store those messages internally and not forward them to the client until such time as they're not inside a transaction block. regards, tom lane
On Wed, 2002-07-03 at 17:48, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > but we are already attracting a thundering herd by > > sending a signal to all _possibly_ interested backends at the same time > > That's why it's so important that the readers use a sharable lock. The > only thing they'd be locking out is some new writer trying to send (yet > another) notify. But there must be some way to communicate the positions of read pointers of all backends for managing the free space, lest we are unable to know when the buffer is full. I imagined that at least this info was kept in share memory. > Also, it's a pretty important optimization to avoid signaling backends > that are not listening for any notifies at all. But of little help when they are all listening to something ;) > We could improve on it further by keeping info in shared memory about > which backends are listening for which notify names, but I don't see > any good way to do that in a fixed amount of space. A compromize would be to do it for some fixed amount of mem (say 10 names/backend) and assume "all" if out of that memory. Notifying everybody has less bad effects when backends listen to more names and keeping lists is pure overhead when all listeners listen to all names. -------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > On Wed, 2002-07-03 at 17:48, Tom Lane wrote: >> That's why it's so important that the readers use a sharable lock. The >> only thing they'd be locking out is some new writer trying to send (yet >> another) notify. > But there must be some way to communicate the positions of read pointers > of all backends for managing the free space, lest we are unable to know > when the buffer is full. Right. But we play similar games already with the existing SI buffer, to wit: Writers grab the controlling lock LW_EXCLUSIVE, thereby having sole access; in this state it's safe for them to examine all the read pointers as well as examine/update the write pointer (and of course write data into the buffer itself). The furthest-back read pointer limits what they can write. Readers grab the controlling lock LW_SHARED, thereby ensuring there is no writer (but there may be other readers). In this state they may examine the write pointer (to see how much data there is) and may examine and update their own read pointer. This is safe and useful because no reader cares about any other's read pointer. >> We could improve on it further by keeping info in shared memory about >> which backends are listening for which notify names, but I don't see >> any good way to do that in a fixed amount of space. > A compromize would be to do it for some fixed amount of mem (say 10 > names/backend) and assume "all" if out of that memory. I thought of that too, but it's not clear how much it'd help. The writer would have to scan through all the per-reader data while holding the write lock, which is not good for concurrency. On SMP hardware it could actually be a net loss. Might be worth experimenting with though. You could make a good reduction in the shared-memory space needed by storing just a hash code for the interesting names, and not the names themselves. (I'd also be inclined to include the hash code in the transmitted message, so that readers could more quickly ignore uninteresting messages.) regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> themselves. (I'd also be inclined to include the hash code in the > >> transmitted message, so that readers could more quickly ignore > >> uninteresting messages.) > > > Doesn't seem worth it, and how would the user know their hash; > > This is not the user's problem; it is the writing backend's > responsibility to compute and add the hash. Basically we trade off some > space to compute the hash code once at the writer not N times at all the > readers. Oh, OK. When you said "transmitted", I thought you meant transmitted to the client. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > themselves. (I'd also be inclined to include the hash code in the > transmitted message, so that readers could more quickly ignore > uninteresting messages.) Doesn't seem worth it, and how would the user know their hash; they already have a C string for comparison. Do we have to handle possible hash collisions? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, 2002-07-03 at 22:43, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > On Wed, 2002-07-03 at 17:48, Tom Lane wrote: > >> That's why it's so important that the readers use a sharable lock. The > >> only thing they'd be locking out is some new writer trying to send (yet > >> another) notify. > > > But there must be some way to communicate the positions of read pointers > > of all backends for managing the free space, lest we are unable to know > > when the buffer is full. > > Right. But we play similar games already with the existing SI buffer, > to wit: > > Writers grab the controlling lock LW_EXCLUSIVE, thereby having sole > access; in this state it's safe for them to examine all the read > pointers as well as examine/update the write pointer (and of course > write data into the buffer itself). The furthest-back read pointer > limits what they can write. It means a full seq scan over pointers ;) > Readers grab the controlling lock LW_SHARED, thereby ensuring there > is no writer (but there may be other readers). In this state they > may examine the write pointer (to see how much data there is) and > may examine and update their own read pointer. This is safe and > useful because no reader cares about any other's read pointer. OK. Now, how will we introduce transactional behaviour to this scheme ? It is easy to save transaction id with each notify message, but is there a quick way for backends to learn when these transactions commit/abort or if they have done either in the past ? Is there already a good common facility for that, or do I just need to examine some random tuples in hope of finding out ;) -------------- Hannu
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> themselves. (I'd also be inclined to include the hash code in the >> transmitted message, so that readers could more quickly ignore >> uninteresting messages.) > Doesn't seem worth it, and how would the user know their hash; This is not the user's problem; it is the writing backend's responsibility to compute and add the hash. Basically we trade off some space to compute the hash code once at the writer not N times at all the readers. regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: >> Right. But we play similar games already with the existing SI buffer, >> to wit: > It means a full seq scan over pointers ;) I have not seen any indication that the corresponding scan in the SI code is a bottleneck --- and that has to scan over *all* backends, without even the opportunity to skip those that aren't LISTENing. > OK. Now, how will we introduce transactional behaviour to this scheme ? It's no different from before --- notify messages don't get into the buffer at all, until they're committed. See my earlier response to Neil. regards, tom lane