Thread: Listen / Notify rewrite
Hi, Attached is a patch for a new listen/notify implementation. In a few words, the patch reimplements listen/notify as an slru-based queue which works similar to the sinval structure. Essentially it is a ring buffer on disk with pages mapped into shared memory for read/write access. Additionally the patch does the following (see below for details): 1. It removes the pg_listener relation and 2. adds the possibility to specify a payload parameter, i.e. executing in SQL "NOTIFY foo 'payload';" and 'payload' will be delivered to any listening backend. 3. Every distinct notification is delivered. 4. Order is preserved, i.e. if txn 1 first does NOTIFY foo, then NOTIFY bar, a backend (listening to both "foo" and "bar") will always first receive the notification "foo" and then the notification "bar". 5. It's now "listen to a channel", not "listen to a relation" anymore... Details: 1. Instead of placing the queue into shared memory only I propose to create a new subdirectory pg_notify/ and make the queue slru-based, such that we do not risk blocking. Several people here have pointed out that blocking is a true no-go for a new listen/notify implementation. With an slru-based queue we have so much space that blocking is really unlikely even in periods with extreme notify bursts. Regarding performance, the slru-queue is not fsync-ed to disk so most activity would be in the OS file cache memory anyway and most backends will probably work on the same pages most of the time. However more locking overhead is required in comparison to a shared-memory-only implementation. There is one doubt that I have: Currently the patch adds notifications to the queue after the transaction has committed to clog. The advantage is that we do not need to take care of visibility. When we add notifications to the queue, we have committed our transaction already and all reading backends are not in a transaction anyway, so everything is visible to everyone and we can just write to and read from the queue. However, if for some reason we cannot write to the slru files in the pg_notify/ directory we might want to roll back the current transaction but with the proposed patch we cannot because we have already committed... But... if there is a problem with the pg_notify/ directory, then something is fundamentally wrong on the file system of the database server and pg_subtrans/ and pg_clog/ are probably affected by the same problem... One possible solution would be to write to the queue before committing and adding the TransactionID. Then other backends can check if our TransactionID has successfully committed or not. Not sure if this is worth the overhead however... 2. The payload parameter is optional. A notifying client can either call "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is currently limited to 128 characters... Not sure if we should allow longer payload strings... If there is more complex data to be transferred, the sending transaction can always put all of that data into a relation and just send the id of the entry. If no payload is specified, then this is treated internally like an empty payload. Consequently an empty string will be delivered as the payload to the listening backend. 3. Not every notification is delivered, only distinct notifications (per transaction). In other words, each sending transaction eliminates duplicate notifications that have the exact same payload and channel name as another notification that is already in the queue to be sent out. Should we have an upper limit on the number of notifications that a transaction is allowed to send? Without an upper limit, a client can open a transaction and send a series of NOTIFYs, each with a different payload until its backend runs out of memory... 4. Sending Transaction does: This will be delivered (on commit): NOTIFY foo; NOTIFY foo; NOTIFY foo; NOTIFY bar 'value1'; NOTIFY bar 'value1'; NOTIFY bar 'value2'; NOTIFY foo; NOTIFY bar 'value2'; NOTIFY bar 'value1'; Note that we do _not_ guarantee that notifications from transaction 1 are always delivered before the notifications of transaction 2 just because transaction 1 committed before transaction 2. Let me know what you think, Regards, Joachim
Attachment
On Wed, Nov 11, 2009 at 10:25:05PM +0100, Joachim Wieland wrote: > Hi, > > Attached is a patch for a new listen/notify implementation. > > In a few words, the patch reimplements listen/notify as an slru-based queue > which works similar to the sinval structure. Essentially it is a ring buffer on > disk with pages mapped into shared memory for read/write access. While I can't really comment on the implementation, from your description it looks like a big improvement. Nice work! Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Nov 11, 2009, at 4:25 PM, Joachim Wieland wrote: > Hi, > > Attached is a patch for a new listen/notify implementation. > > In a few words, the patch reimplements listen/notify as an slru- > based queue > which works similar to the sinval structure. Essentially it is a > ring buffer on > disk with pages mapped into shared memory for read/write access. > > Additionally the patch does the following (see below for details): > > 1. It removes the pg_listener relation and > 2. adds the possibility to specify a payload parameter, i.e. > executing in SQL > "NOTIFY foo 'payload';" and 'payload' will be delivered to any > listening > backend. > 3. Every distinct notification is delivered. > 4. Order is preserved, i.e. if txn 1 first does NOTIFY foo, then > NOTIFY bar, a > backend (listening to both "foo" and "bar") will always first > receive the > notification "foo" and then the notification "bar". > 5. It's now "listen to a channel", not "listen to a relation" > anymore... Hi Joachim, Thank you for implementing this- LISTEN/NOTIFY without a payload has been a major problem to work around for me. I understand that coalescing multiple notifications of the same name happens now, but I never understood why. I had hoped that someone revisiting this "feature" would remove it. My use case is autonomous transactions. From a developer's standpoint, NOTIFY specifies a form of remote trigger for further action- outside the transaction- to occur. From that point of view, I don't see why NOTIFY foo; NOTIFY foo; is equivalent to NOTIFY foo;. I understand the use case where a per-row trigger could generate lots of spurious notifications, but it seems that if anything is generating spurious notifications, the notification is in the wrong place. The documentation makes the strong implication that notification names are usually table names, but with the new payload, this becomes even less the case. ""I changed this table, take a look at it to see what's new". But no such association is enforced by the NOTIFY and LISTEN commands." http://www.postgresql.org/docs/8.4/static/sql-notify.html With the coalescing, I feel like I am being punished for using NOTIFY with something other than a table name use case. At least with this new payload, I can set the payload to the transaction ID and be certain that all the notifications I sent are processed (and in order even!) but could you explain why the coalescing is still necessary? While coalescing could be achieved on the receiving-end NOTIFY callback ( if(payload ID was already processed) continue; ), non- coalescing behavior cannot be achieved when the backend does the coalescing. For backwards compatibility, payload-less NOTIFY could coalesce while NOTIFYs with a payload would not coalesce, but, on the other hand, that might be confusing. In any case, thank you for improving this long-neglected subsystem! Best regards, M
>> 2. adds the possibility to specify a payload parameter, i.e. executing >> in SQL >> "NOTIFY foo 'payload';" and 'payload' will be delivered to any >> listening >> backend. > > Thank you for implementing this- LISTEN/NOTIFY without a payload has > been a major problem to work around for me. +1 -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
/* + * This function is executed for every notification found in the queue in order + * to check if the current backend is listening on that channel. Not sure if we + * should further optimize this, for example convert to a sorted array and + * allow binary search on it... + */ + static bool + IsListeningOn(const char *channel) I think a bsearch would be needed. Very busy servers that make heavy use of notifies would be quite a penalty. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Thu, Nov 12, 2009 at 1:04 AM, Andrew Chernow <ac@esilo.com> wrote: > I think a bsearch would be needed. Very busy servers that make heavy use of > notifies would be quite a penalty. In such an environment, how many relations/channels is a backend typically listening to? Do you have average / maximal numbers? Regards, Joachim
Joachim Wieland wrote: > On Thu, Nov 12, 2009 at 1:04 AM, Andrew Chernow <ac@esilo.com> wrote: >> I think a bsearch would be needed. Very busy servers that make heavy use of >> notifies would be quite a penalty. > > In such an environment, how many relations/channels is a backend > typically listening to? > Do you have average / maximal numbers? > We have a system where many libpq clients, ~2000 - 4000 per server (depends on hardware), maintain a persistent backend connection. Each connection listens for notifies, LISTEN 'client_xxx'. There are maybe 10 different reasons why a NOTIFY 'client_xxx' is fired. Sometimes, notifies are broadcasted to all client connections, or just portions of them. The goal is real-time messaging to a large groups of computers/devices. Past 4000, the problem is distributed to a second, third, etc... server. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Andrew Chernow <ac@esilo.com> writes: > + * This function is executed for every notification found in the queue in order > + * to check if the current backend is listening on that channel. Not sure if we > + * should further optimize this, for example convert to a sorted array and > + * allow binary search on it... > I think a bsearch would be needed. Very busy servers that make heavy use of > notifies would be quite a penalty. Premature optimization is the root of all evil ;-). Unless you've done some profiling and can show that this is a hot spot, making it more complicated isn't the thing to be doing now. regards, tom lane
>>>>> "Martijn" == Martijn van Oosterhout <kleptog@svana.org> writes: >> Hi,>> >> Attached is a patch for a new listen/notify implementation.>> >> In a few words, the patch reimplements listen/notifyas an>> slru-based queue which works similar to the sinval>> structure. Essentially it is a ring buffer on diskwith pages>> mapped into shared memory for read/write access. Martijn> While I can't really comment on the implementation, from yourMartijn> description it looks like a big improvement. Does it cope with the case where a trigger is doing NOTIFY, and you do a whole-table update, therefore dumping potentially millions of notifications in at once? (for example a rare maintenance operation on a table which has a listen/notify arrangement triggered by single inserts or updates) The existing implementation copes with that just fine. -- Andrew (irc:RhodiumToad)
> Premature optimization is the root of all evil ;-). Unless you've done > some profiling and can show that this is a hot spot, making it more > complicated isn't the thing to be doing now. > I'm thinking of how our system uses/abuses notifies, and began wondering if several thousand backends listening with a large queue would perform decently behind a linear search. At this point, I have no data either way; only an assumption based off being burnt by sequential scans in the past ;) -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Wed, Nov 11, 2009 at 5:48 PM, A.M. <agentm@themactionfaction.com> wrote: > At least with this new payload, I can set the payload to the transaction ID > and be certain that all the notifications I sent are processed (and in order > even!) but could you explain why the coalescing is still necessary? Christmas comes early this year! :-). three reasons: *) it works that way now...a lot of people use this feature for all kinds of subtle things and the behavior chould change as little as possible *) legacy issues aside, I think it's generally better behavior (how many times do you need to be tapped on the shoulder?) *) since you can trivially differentiate it (using xid, sequence, etc), what's the fuss? merlin
On Nov 11, 2009, at 9:28 PM, Merlin Moncure wrote: > On Wed, Nov 11, 2009 at 5:48 PM, A.M. > <agentm@themactionfaction.com> wrote: >> At least with this new payload, I can set the payload to the >> transaction ID >> and be certain that all the notifications I sent are processed >> (and in order >> even!) but could you explain why the coalescing is still necessary? > > Christmas comes early this year! :-). > > three reasons: > *) it works that way now...a lot of people use this feature for all > kinds of subtle things and the behavior chould change as little as > possible > *) legacy issues aside, I think it's generally better behavior (how > many times do you need to be tapped on the shoulder?) > *) since you can trivially differentiate it (using xid, sequence, > etc), what's the fuss? Except for the fact that the number of times a notification occurred may be valuable information. I thought of a compromise: add the number of times a notification was generated (coalesced count+1) to the callback data. That would satisfy any backwards compatibility concerns and my use case too! Cheers, M
Joachim Wieland <joe@mcknight.de> writes: > However, if for some reason we cannot write to the slru files in the pg_notify/ > directory we might want to roll back the current transaction but with the > proposed patch we cannot because we have already committed... I think this is a deal-breaker, and arguing about how the pg_notify directory ought to be writable is not even slightly acceptable --- out of disk space is an obvious risk. The existing implementation guarantees that notifications obey ACID: if you commit, they are sent, and if you don't, they aren't. You can't just put in something that works most of the time instead. > One possible solution would be to write to the queue before committing > and adding the TransactionID. Then other backends can check if our > TransactionID has successfully committed or not. Not sure if this is > worth the overhead however... That sounds reasonable, and it's certainly no more overhead than the tuple visibility checks that happen in the current implementation. > 2. The payload parameter is optional. A notifying client can either call > "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is > currently limited to 128 characters... Not sure if we should allow longer > payload strings... Might be a good idea to make the max the same as the max length for prepared transaction GUIDs? Not sure anyone would be shipping those around, but it's a pre-existing limit of about the same size. regards, tom lane
> > I thought of a compromise: add the number of times a notification was > generated (coalesced count+1) to the callback data. That would satisfy > any backwards compatibility concerns and my use case too! > If you are suggesting that the server poke data into the notifier's opaque payload, I vote no. Maybe the NOTIFY command can include a switch to enable this behavior. No syntax suggestions at this point. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Andrew Chernow <ac@esilo.com> writes: >> I thought of a compromise: add the number of times a notification was >> generated (coalesced count+1) to the callback data. That would satisfy >> any backwards compatibility concerns and my use case too! > If you are suggesting that the server poke data into the notifier's opaque > payload, I vote no. Maybe the NOTIFY command can include a switch to enable > this behavior. No syntax suggestions at this point. I agree, we should not have the system modifying the payload string for this. And don't bother suggesting a third column in the result --- we'd have to change the FE/BE protocol for that, and it's not going to happen. The existing precedent is that the system collapses identical notifications without payloads. So we could possibly get away with saying that identical payload-less notifies are collapsed but those with a payload are not. That doesn't really seem to satisfy the POLA though. I think Joachim's definition is fine, and anyone who needs delivery of distinct notifications can easily make his payload strings unique to ensure it. regards, tom lane
>> 2. The payload parameter is optional. A notifying client can either call >> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is >> currently limited to 128 characters... Not sure if we should allow longer >> payload strings... > > Might be a good idea to make the max the same as the max length for > prepared transaction GUIDs? Not sure anyone would be shipping those > around, but it's a pre-existing limit of about the same size. > I don't see any reason to impose such a small limit on payload size. Surely some limit must exist, but 128 characters seems awfully small. I already have at few places in mind that would require more bytes. Why not 8K, 64K, 256K, 1M or even more? Is there some other factor in play forcing this limitation? -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Nov 11, 2009, at 10:43 PM, Tom Lane wrote: > Andrew Chernow <ac@esilo.com> writes: >>> I thought of a compromise: add the number of times a notification >>> was >>> generated (coalesced count+1) to the callback data. That would >>> satisfy >>> any backwards compatibility concerns and my use case too! > >> If you are suggesting that the server poke data into the >> notifier's opaque >> payload, I vote no. Maybe the NOTIFY command can include a switch >> to enable >> this behavior. No syntax suggestions at this point. > > I agree, we should not have the system modifying the payload string > for > this. And don't bother suggesting a third column in the result --- > we'd have to change the FE/BE protocol for that, and it's not going > to happen. > > The existing precedent is that the system collapses identical > notifications without payloads. So we could possibly get away with > saying that identical payload-less notifies are collapsed but those > with a payload are not. That doesn't really seem to satisfy the > POLA though. I think Joachim's definition is fine, and anyone who > needs delivery of distinct notifications can easily make his payload > strings unique to ensure it. The notification count could be a secondary "payload" which does not affect the first, but I guess I'm the only one complaining about the coalescing... -M
On Thu, Nov 12, 2009 at 2:12 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Does it cope with the case where a trigger is doing NOTIFY, and you do > a whole-table update, therefore dumping potentially millions of > notifications in at once? > > (for example a rare maintenance operation on a table which has a > listen/notify arrangement triggered by single inserts or updates) > > The existing implementation copes with that just fine. As long as you use the existing way to send out notifications by just sending "NOTIFY foo", then yes, it will cope with millions of them just fine because it will collapse them into one notification as does the current implementation (contrary to the current implementation a notification will be received from every transaction that has sent one - the current implementation even collapses notifications from different transactions). However if you decide to use the new syntax and add a distinct payload to every NOTIFY, then you also send out millions of notifications... With the proposed patch the queue has space for up to about 81,787,680 notifications. The limits are mainly imposed by slru.c which defines page size, pages per segment and the number of segments available. We could easily bump that up to a multiple of the current limits if we have decided that we need to... Joachim
On Thu, Nov 12, 2009 at 4:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> One possible solution would be to write to the queue before committing >> and adding the TransactionID. Then other backends can check if our >> TransactionID has successfully committed or not. Not sure if this is >> worth the overhead however... > > That sounds reasonable, and it's certainly no more overhead than the > tuple visibility checks that happen in the current implementation. I am not too concerned about the runtime of the visibility checks, instead I suppose that most of the overhead will come from waiting for another transaction to either commit or abort... If transaction t1 scans the queue and at some point finds notifications from t2, then it will ask for the status of t2. If it finds out that t2 is still running, then it has no other option than to stop working on the queue and wait (it will be signalled again later once t2 has finished). This also means that we cannot at the same time write notifications to the queue and read from it and if a transaction places a few million notifications into the queue, readers need to wait until it has finished and only after that they can continue and read the notifications... And it means that if the queue is full, we might run into a deadlock... A transaction adding notifications will wait for the readers to proceed and the readers wait for the transaction to commit or abort... One option could be to write new notifications to a subdirectory, and create a bunch of new segment files there. Once this is done, the segment files could be moved over and renamed, so that they continue the slru queue... If we run out of disk space while filling that temporary subdirectory, then we can just delete the subdirectory and nobody has been blocked. We could still run into errors moving and renaming the segment files (e.g. permission problems) so that we still might need to abort the transaction... >> 2. The payload parameter is optional. A notifying client can either call >> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is >> currently limited to 128 characters... Not sure if we should allow longer >> payload strings... > > Might be a good idea to make the max the same as the max length for > prepared transaction GUIDs? Not sure anyone would be shipping those > around, but it's a pre-existing limit of about the same size. Yes, sounds reasonable to have the same limit for user-defined identifiers... Joachim
>>> 2. The payload parameter is optional. A notifying client can either call >>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is >>> currently limited to 128 characters... Not sure if we should allow longer >>> payload strings... >> Might be a good idea to make the max the same as the max length for >> prepared transaction GUIDs? Not sure anyone would be shipping those >> around, but it's a pre-existing limit of about the same size. > > Yes, sounds reasonable to have the same limit for user-defined identifiers... > [..begging..] Can this be increased significantly? I don't get it, is there any technical reason to make the limit soo small? This drastically reduces the usefulness of the payload. I've wanted this feature for quite sometime and it is quite disappointing that I could not even use it because it is unjustifiably limited. One use case I need is making the payload an absolute path, which saves us a round trip (commonly internet latency) and a query in a section of the system that's extremely performance sensitive. That sure ain't going to fit in 128 bytes. I'm sure I'm not the only one who finds this limit too small. I can almost guarentee complaints would come in if released that way. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Thu, Nov 12, 2009 at 8:25 AM, Andrew Chernow <ac@esilo.com> wrote: > >>>> 2. The payload parameter is optional. A notifying client can either call >>>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is >>>> currently limited to 128 characters... Not sure if we should allow >>>> longer >>>> payload strings... >>> >>> Might be a good idea to make the max the same as the max length for >>> prepared transaction GUIDs? Not sure anyone would be shipping those >>> around, but it's a pre-existing limit of about the same size. >> >> Yes, sounds reasonable to have the same limit for user-defined >> identifiers... >> > > [..begging..] Can this be increased significantly? I don't get it, is there > any technical reason to make the limit soo small? This drastically reduces > the usefulness of the payload. I've wanted this feature for quite sometime > and it is quite disappointing that I could not even use it because it is > unjustifiably limited. +1 What advantage is there in limiting it to a tiny size? This is a 'payload' after all...an arbitrary data block. Looking at the patch I noticed the payload structure (AsyncQueueEntry) is fixed length and designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages. Couple of questions: *) is BLCKSZ a hard requirement, that is, coming from the slru implementation, or can QUEUE_PAGESIZE be bumped independently of block size. *) why not make the AsyncQueueEntry divide evenly into BLCKSZ, that is, make the whole structure a size that is a multiple of two? (this would make the payload length 'weird') *) is there any downside you see to making the AsyncQueueEntry structure exactly BLCKSZ bytes in size? Are we worried about the downsides of spinning the notifications out to disk? *) Is a variable length AsyncQueueEntry possible? (presumably bounded by the max page size). Or does complicate the implementation too much? merlin merlin
> > What advantage is there in limiting it to a tiny size? This is a > 'payload' after all...an arbitrary data block. Looking at the patch I > noticed the payload structure (AsyncQueueEntry) is fixed length and > designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages. > Hmmmm. Looks like the limitation comes from slru. The true payload limit is (8K - struct members) the way this is implemented. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Thu, Nov 12, 2009 at 3:09 PM, Andrew Chernow <ac@esilo.com> wrote: >> >> What advantage is there in limiting it to a tiny size? This is a >> 'payload' after all...an arbitrary data block. Looking at the patch I >> noticed the payload structure (AsyncQueueEntry) is fixed length and >> designed to lay into QUEUE_PAGESIZE (set to) BLCKSZ sized pages. >> > > Hmmmm. Looks like the limitation comes from slru. The true payload limit > is (8K - struct members) the way this is implemented. So I'm beginning to think that adding a "payload" to NOTIFY is a bad idea altogether. The original use case NOTIFY was designed to handle was to implement condition variables. You have clients wait on some data structure and whenever someone changes that data structure they notify all waiting clients to reread the data structure and check if their condition is true. The condition might be that the data structure doesn't match their locally cached copy and they have to rebuild their cache, it might be that some work queue is non-empty, etc. So the way to build a queue would be to use a table to hold your work queue and then use NOTIFY whenever you insert into the queue and any idle workers would know to go dequeue something from the queue. It sounds like what people are trying to do is use NOTIFY as a queue directly. The problem with this is that it was specifically not designed to do this. As evidenced by the fact that it can't store arbitrary data structures (nothing else in postgres is limited to only handling text data, this would be an annoying arbitrary limitation), can't handle even moderately large data efficiently, etc. I'm beginning to think the right solution is to reject the idea of adding a payload to the NOTIFY mechanism and instead provide a queue contrib module which provides the interface people want. -- greg
On Thu, Nov 12, 2009 at 3:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Couple of questions: > > *) is BLCKSZ a hard requirement, that is, coming from the slru > implementation, or can QUEUE_PAGESIZE be bumped independently of block > size. It's the size of slru's pages. > *) why not make the AsyncQueueEntry divide evenly into BLCKSZ, that > is, make the whole structure a size that is a multiple of two? (this > would make the payload length 'weird') it's possible and yes, it would make the payload length weird. Also if we wanted to add further structure members later on (commit time, xid, notify counter, whatever) we'd have to announce that payload in a new release cannot have the same length as in a previous one, so I tried to keep both independent right from the beginning. > *) is there any downside you see to making the AsyncQueueEntry > structure exactly BLCKSZ bytes in size? Are we worried about the > downsides of spinning the notifications out to disk? Yes, and also we'd need a lot more page slots to work efficiently and I fear quite some queue managing overhead (locking). Currently we are using 4 page slots and can have ~160 notifications mapped into memory. > *) Is a variable length AsyncQueueEntry possible? (presumably bounded > by the max page size). Or does complicate the implementation too > much? That's an option to look at, I think it's technically possible, we'd just need to throw in some more logic but I didnt want to invest too much effort before we have a clear way to go. However I share Greg's concerns that people are trying to use NOTIFY as a message queue which it is not designed to be. Joachim
--On 12. November 2009 15:48:44 +0000 Greg Stark <gsstark@mit.edu> wrote: > I'm beginning to think the right solution is to reject the idea of > adding a payload to the NOTIFY mechanism and instead provide a queue > contrib module which provides the interface people want. Isn't PgQ already the solution you have in mind? -- Thanks Bernd
> However I share Greg's concerns that people are trying to use NOTIFY > as a message queue which it is not designed to be. When you have an established libpq connection waiting for notifies it is not unreasonable to expect/desire a payload. ISTM, the problem is that the initial design was half-baked. NOTIFY is event-driven, ie. no polling! -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Joachim Wieland <joe@mcknight.de> writes: > However I share Greg's concerns that people are trying to use NOTIFY > as a message queue which it is not designed to be. Yes. Particularly those complaining that they want to have very large payload strings --- that's pretty much a dead giveaway that it's not being used as a condition signal. Now you might say that yeah, that's the point, we're trying to enable using NOTIFY in a different style. The problem is that if you are trying to use NOTIFY as a queue, you will soon realize that it has the wrong semantics for that --- in particular, losing notifies across a system crash or client crash is OK for a condition notification, not so OK for a message queue. The difference is that the former style assumes that the authoritative data is in a table somewhere, so you can still find out what you need to know after reconnecting. If you are doing messaging you are likely to think that you don't need any backing store for the system state. So while a payload string for NOTIFY has been on the to-do list since forever, I have to think that Greg's got a good point questioning whether it is actually a good idea. regards, tom lane
On Thu, Nov 12, 2009 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So while a payload string for NOTIFY has been on the to-do list since > forever, I have to think that Greg's got a good point questioning > whether it is actually a good idea. Here's an example of why I'd like a payload (and not a queue in an add-on module). Say you have multiple users running pgAdmin. One of them creates a new table. Currently, unless the other user refreshes his view of the database, he won't see it. I'd like to be able to notify the other pgAdmin sessions that a new table has been created, so they can automatically display it, without having to poll pg_class or be manually refreshed. The payload could contain details of type of object that has been created, and it's OID/identifier to minimise the work required of the other sessions to find and display the new object. And as I'm sure you're already thinking it, yes, I know it doesn't help if the new table is created using psql, but there are lots of shops where pgAdmin is the default tool, and it could help them and just exhibit the current behaviour if someone does break out psql. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joachim Wieland <joe@mcknight.de> writes: >> However I share Greg's concerns that people are trying to use NOTIFY >> as a message queue which it is not designed to be. > > Yes. Particularly those complaining that they want to have very large > payload strings --- that's pretty much a dead giveaway that it's not > being used as a condition signal. > > Now you might say that yeah, that's the point, we're trying to enable > using NOTIFY in a different style. The problem is that if you are > trying to use NOTIFY as a queue, you will soon realize that it has > the wrong semantics for that --- in particular, losing notifies across > a system crash or client crash is OK for a condition notification, > not so OK for a message queue. The difference is that the former style > assumes that the authoritative data is in a table somewhere, so you can > still find out what you need to know after reconnecting. If you are > doing messaging you are likely to think that you don't need any backing > store for the system state. > > So while a payload string for NOTIFY has been on the to-do list since > forever, I have to think that Greg's got a good point questioning > whether it is actually a good idea. I think there could be cases where the person writing the code can know, extrinsic to the system, that lost notifications are OK, and still want to deliver a payload. But I think the idea of enabling a huge payload is not wise, as it sounds like it will sacrifice performance for a feature that is by definition not essential to anyone who is using this now. A small payload seems like a reasonable compromise. ...Robert
On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joachim Wieland <joe@mcknight.de> writes: >> However I share Greg's concerns that people are trying to use NOTIFY >> as a message queue which it is not designed to be. > > Yes. Particularly those complaining that they want to have very large > payload strings --- that's pretty much a dead giveaway that it's not > being used as a condition signal. > > Now you might say that yeah, that's the point, we're trying to enable > using NOTIFY in a different style. The problem is that if you are > trying to use NOTIFY as a queue, you will soon realize that it has > the wrong semantics for that --- in particular, losing notifies across > a system crash or client crash is OK for a condition notification, > not so OK for a message queue. The difference is that the former style > assumes that the authoritative data is in a table somewhere, so you can > still find out what you need to know after reconnecting. If you are > doing messaging you are likely to think that you don't need any backing > store for the system state. > > So while a payload string for NOTIFY has been on the to-do list since > forever, I have to think that Greg's got a good point questioning > whether it is actually a good idea. You guys are assuming it's being used in a queue, which is only a fraction of cases where this feature is useful. In fact, having a payload can remove the need for a queue completely where is currently required for no other reason to deliver payload messages. I'm sorry, the 128 character limit is simply lame (other than for unsolvable implementation/performance complexity which I doubt is the case here), and if that constraint is put in by the implementation, than the implementation is busted and should be reworked until it's right. A feature that is being used for things not intended is a sign of a strong feature, not a weak one, and the idea that a payload should be length limited in order to prevent use in ways that are 'wrong' is a very weak argument IMO. People have been asking for this feature since the beginning of time, and nobody said: 'please limit it to 128 bytes'. A limit of 4k - 64k is much more appropriate if you even want a hard limit at all... merlin
On Thu, Nov 12, 2009 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 12, 2009 at 11:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Joachim Wieland <joe@mcknight.de> writes: >>> However I share Greg's concerns that people are trying to use NOTIFY >>> as a message queue which it is not designed to be. >> >> Yes. Particularly those complaining that they want to have very large >> payload strings --- that's pretty much a dead giveaway that it's not >> being used as a condition signal. >> >> Now you might say that yeah, that's the point, we're trying to enable >> using NOTIFY in a different style. The problem is that if you are >> trying to use NOTIFY as a queue, you will soon realize that it has >> the wrong semantics for that --- in particular, losing notifies across >> a system crash or client crash is OK for a condition notification, >> not so OK for a message queue. The difference is that the former style >> assumes that the authoritative data is in a table somewhere, so you can >> still find out what you need to know after reconnecting. If you are >> doing messaging you are likely to think that you don't need any backing >> store for the system state. >> >> So while a payload string for NOTIFY has been on the to-do list since >> forever, I have to think that Greg's got a good point questioning >> whether it is actually a good idea. > > I think there could be cases where the person writing the code can > know, extrinsic to the system, that lost notifications are OK, and > still want to deliver a payload. But I think the idea of enabling a > huge payload is not wise, as it sounds like it will sacrifice > performance for a feature that is by definition not essential to 'premature optimization is the root of all evil' :-) merlin
> Now you might say that yeah, that's the point, we're trying to enable > using NOTIFY in a different style. The problem is that if you are > trying to use NOTIFY as a queue, you will soon realize that it has > the wrong semantics for that --- in particular, losing notifies across > a system crash or client crash is OK for a condition notification, > not so OK for a message queue. The difference is that the former style > assumes that the authoritative data is in a table somewhere, so you can > still find out what you need to know after reconnecting. If you are > doing messaging you are likely to think that you don't need any backing > store for the system state. > I simply don't agree that the semantics have to change. You call it a queue, I call it sesison data. There is no reason why the documentation can't state that notifies may not be delivered due to crashes, so make sure to use persistent storage for any payload worth keeping post-session. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Thu, Nov 12, 2009 at 11:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > I'm sorry, the 128 character limit is simply lame (other than for > unsolvable implementation/performance complexity which I doubt is the > case here), and if that constraint is put in by the implementation, > than the implementation is busted and should be reworked until it's > right. After some reflection, I realized this was an overly strong statement and impolite to the OP. It's easy to yarp from the gallery with the other peanuts :-). It's not the implementation I have an issue with, just the _idea_ that we should be restricted to small payloads for religious reasons...until that came upI was already scheming on how to both extend the patch to be more flexible in terms of payload size, and to backpatch and test it on 8.4 (no point if the community has no interest however). In any event, sorry for the strong words. merlin
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Tom Lane writes: > Yes. Particularly those complaining that they want to have very large > payload strings --- that's pretty much a dead giveaway that it's not > being used as a condition signal. I don't want "large" but I do think an arbitrary limit of 128 is odd without some justfication. I could do 128, but would probably be happier with a bit more room. > Now you might say that yeah, that's the point, we're trying to enable > using NOTIFY in a different style. The problem is that if you are > trying to use NOTIFY as a queue, you will soon realize that it has > the wrong semantics for that --- in particular, losing notifies across > a system crash or client crash is OK for a condition notification, > not so OK for a message queue. The difference is that the former style > assumes that the authoritative data is in a table somewhere, so you can > still find out what you need to know after reconnecting. If you are > doing messaging you are likely to think that you don't need any backing > store for the system state. That's putting an awful lot of assumptions on what people are going to do in the future, and also not being imaginative enough for circumstances in which a payload which is not system crash survivable is a completely acceptable condition. In most of my use cases, even desired. People wanting a real queue can continue to use something other than NOTIFY. > So while a payload string for NOTIFY has been on the to-do list since > forever, I have to think that Greg's got a good point questioning > whether it is actually a good idea. Absolutely is a good idea. Agent M asks: > The notification count could be a secondary "payload" which does not > affect the first, but I guess I'm the only one complaining about the > coalescing... Er...this thread is only a few hours old. For the record, I'm fine with the coalescing as we do now (at least as far as my own selfish purposes :) - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200911121836 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkr8nCwACgkQvJuQZxSWSsjQIgCgjH60LlZYEek9FwcD+/w4IHYQ PWwAnR0YxdSBm5iBa+G+T1VpIP4qjJsx =Ju0P -----END PGP SIGNATURE-----
On 11/12/09 8:30 AM, Tom Lane wrote: > So while a payload string for NOTIFY has been on the to-do list since > forever, I have to think that Greg's got a good point questioning > whether it is actually a good idea. Sure, people will abuse it as a queue. But people abuse arrays when they should be using child tables, use composite types to make data non-atomic, and use dblink when they really should be using schema. Does the potential for misuse mean that we should drop the features? No. Payloads are also quite useful even in a lossy environment, where you understand that LISTEN is not a queue. For example, I'd like to be using LISTEN/NOTIFY for cache invalidation for some applications; if it misses a few, or double-counts them, it's not an issue. However, I'd like to be able to send message like players_updated|45321 where 45321 is the ID of the player updated. --Josh Berkus
On Thu, Nov 12, 2009 at 8:44 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 11/12/09 8:30 AM, Tom Lane wrote: >> So while a payload string for NOTIFY has been on the to-do list since >> forever, I have to think that Greg's got a good point questioning >> whether it is actually a good idea. > > Sure, people will abuse it as a queue. But people abuse arrays when > they should be using child tables, use composite types to make data > non-atomic, and use dblink when they really should be using schema. > Does the potential for misuse mean that we should drop the features? No. I agree. We frequently reject features on the basis that someone might do something stupid with them. It's lame and counterproductive, and we should stop. The world contains infinite amounts of lameness, but that's the world's problem, not ours. There is zero evidence that this feature is only useful for stupid purposes, and some evidence (namely, the opinions of esteemed community members) that it is useful for at least some non-stupid purposes. ...Robert
> and we should stop. The world contains infinite amounts of lameness, > but that's the world's problem, not ours. There is zero evidence that +1 > this feature is only useful for stupid purposes, and some evidence > (namely, the opinions of esteemed community members) that it is useful > for at least some non-stupid purposes. The unexpected application of a feature can be creative or innovative, which I firmly believe is something this community embraces. How many ways can a screw driver be used ... think MacGyver :) Deteriming whether it's creative vs. stupid would require an understanding of the context in which it was applied. For example, using our screw driver to remove a splinter would be rather stupid, IMHO ;) -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Nov 12, 2009, at 5:57 PM, Robert Haas wrote: > On Thu, Nov 12, 2009 at 8:44 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 11/12/09 8:30 AM, Tom Lane wrote: >>> So while a payload string for NOTIFY has been on the to-do list since >>> forever, I have to think that Greg's got a good point questioning >>> whether it is actually a good idea. >> >> Sure, people will abuse it as a queue. But people abuse arrays when >> they should be using child tables, use composite types to make data >> non-atomic, and use dblink when they really should be using schema. >> Does the potential for misuse mean that we should drop the features? No. > > I agree. We frequently reject features on the basis that someone > might do something stupid with them. It's lame and counterproductive, > and we should stop. The world contains infinite amounts of lameness, > but that's the world's problem, not ours. There is zero evidence that > this feature is only useful for stupid purposes, and some evidence > (namely, the opinions of esteemed community members) that it is useful > for at least some non-stupid purposes. Speaking as a consumer of this feature, my (mild) concern is not that by adding functionality it will make it possible to do new things badly, it's that it might make it harder (or impossible) to do currently supported things as well. I like the current notify. It's a good match for handling table based queues or for app-level-cache invalidation. Any changes that make it less good at that would be a step backwards. The more complex and flexible and "heavier" the changes, the more that concerns me. (Though a small payload - I'd settle for at least an integer - would be convenient, to allow invalidation of 20 different caches without needing 20 different LISTENs). Cheers, Steve
Unfortunately with all that payload-length discussion, the other part of my email regarding ACID compliant behavior got completely lost. I would appreciate some input on that part also... Thanks, Joachim On Thu, Nov 12, 2009 at 12:17 PM, Joachim Wieland <joe@mcknight.de> wrote: > On Thu, Nov 12, 2009 at 4:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> One possible solution would be to write to the queue before committing >>> and adding the TransactionID. Then other backends can check if our >>> TransactionID has successfully committed or not. Not sure if this is >>> worth the overhead however... >> >> That sounds reasonable, and it's certainly no more overhead than the >> tuple visibility checks that happen in the current implementation. > > I am not too concerned about the runtime of the visibility checks, > instead I suppose that most of the overhead will come from waiting for > another transaction to either commit or abort... > > If transaction t1 scans the queue and at some point finds > notifications from t2, then it will ask for the status of t2. If it > finds out that t2 is still running, then it has no other option than > to stop working on the queue and wait (it will be signalled again > later once t2 has finished). > > This also means that we cannot at the same time write notifications to > the queue and read from it and if a transaction places a few million > notifications into the queue, readers need to wait until it has > finished and only after that they can continue and read the > notifications... > > And it means that if the queue is full, we might run into a > deadlock... A transaction adding notifications will wait for the > readers to proceed and the readers wait for the transaction to commit > or abort... > > One option could be to write new notifications to a subdirectory, and > create a bunch of new segment files there. Once this is done, the > segment files could be moved over and renamed, so that they continue > the slru queue... If we run out of disk space while filling that > temporary subdirectory, then we can just delete the subdirectory and > nobody has been blocked. We could still run into errors moving and > renaming the segment files (e.g. permission problems) so that we still > might need to abort the transaction... > > >>> 2. The payload parameter is optional. A notifying client can either call >>> "NOTIFY foo;" or "NOTIFY foo 'payload';". The length of the payload is >>> currently limited to 128 characters... Not sure if we should allow longer >>> payload strings... >> >> Might be a good idea to make the max the same as the max length for >> prepared transaction GUIDs? Not sure anyone would be shipping those >> around, but it's a pre-existing limit of about the same size. > > Yes, sounds reasonable to have the same limit for user-defined identifiers... > > > Joachim >
On Fri, Nov 13, 2009 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I agree. We frequently reject features on the basis that someone > might do something stupid with them. It's lame and counterproductive, > and we should stop. The world contains infinite amounts of lameness, > but that's the world's problem, not ours. There is zero evidence that > this feature is only useful for stupid purposes, and some evidence > (namely, the opinions of esteemed community members) that it is useful > for at least some non-stupid purposes. This is BS. The problem is not that someone might do something stupid with this feature. The problem is that we're making these other use cases into requirements which will influence the design. This is a classic "feature creep" situation and the result is normally products which solve none of the use cases especially well. Remember this queue has to live in shared memory which is a fixed size resource. If you're designing a queue mechanism then you would naturally use something like a queue or priority queue. You expect to spill to disk and need an efficient storage mechanism. The natural implementation of this in Postgres would be a table, not the slru. If you're designing a condition-variable mechanism then you would naturally use a hash table which can probably live in a single page with a simple flag for each variable. The comment in another thread that this mechanism should implement ACID properties just reinforces my reaction. -- greg
On Fri, Nov 13, 2009 at 5:35 AM, Greg Stark <gsstark@mit.edu> wrote: > On Fri, Nov 13, 2009 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree. We frequently reject features on the basis that someone >> might do something stupid with them. It's lame and counterproductive, >> and we should stop. The world contains infinite amounts of lameness, >> but that's the world's problem, not ours. There is zero evidence that >> this feature is only useful for stupid purposes, and some evidence >> (namely, the opinions of esteemed community members) that it is useful >> for at least some non-stupid purposes. > > This is BS. The problem is not that someone might do something stupid > with this feature. The problem is that we're making these other use > cases into requirements which will influence the design. This is a > classic "feature creep" situation and the result is normally products > which solve none of the use cases especially well. Removing a length restriction is feature creep? Having an flexible payload mechanism improves on notify in the same way that epoll improves on poll. Yes, epoll is overdesigned, highly overused, etc. but it does vastly improve server handling/responsiveness in some situations. Delivering a notification with data saves a round trip back to the server and a transaction which is both helpful in terms of server load and improving latency. On top of that, I don't think saying: "hello; here's some data" is groundbreaking in terms of network communication paradigms. My interest in this feature is not academic, the project I'm working on could use it with great benefit immediately. Arguments that I am using notify for the set list of use cases improvised by the original authors are not going to hold much water with me :-). IMNSHO, I don't think that keeping payloads limited to a tiny size 'improves' this feature is a winnable argument. That said, I do appreciate simple designs and very much understand trying to keep things simple. So let me ask you this: *) Are you sure that putting a variable length payload into the slru is going to complicate things that badly in terms of implementing this feature? If so, how? *) Wouldn't you agree that variable length would actually benefit 'proper' (small) payloads by allowing more of them to fit in the slru page? *) 8k should be enough for anybody :-) ...so if a variable length structure can be made why not max the payload length at blcksz-hdrsz and call it a day (yes, I am aware that extending the structure will reduce payload maximum length)? I think this should fit quite nicely into the OP's approach and benefits both people who use small payloads and large ones...(I DO think spanning pages is complex and probably unnecessary) merlin
Joachim Wieland wrote: > 1. Instead of placing the queue into shared memory only I propose to create a > new subdirectory pg_notify/ and make the queue slru-based, such that we do not > risk blocking. Several people here have pointed out that blocking is a true > no-go for a new listen/notify implementation. With an slru-based queue we have > so much space that blocking is really unlikely even in periods with extreme > notify bursts. > Regarding performance, the slru-queue is not fsync-ed to disk so most activity > would be in the OS file cache memory anyway and most backends will probably > work on the same pages most of the time. However more locking overhead is > required in comparison to a shared-memory-only implementation. Note, however, that pg_subtrans is also supposed to use "the same pages from memory most of the time" and it still is a performance bottleneck in some cases, and enlarging NUM_SUBTRANS_BUFFERS is a huge boon. I think holding AsyncCtlLock in exclusive mode for every notify add (which may involve I/O) is going to be a hard hit on scalability. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
> spill to disk and need an efficient storage mechanism. The natural > implementation of this in Postgres would be a table, not the slru. If This is what I think the people's real problem is, the implementation becomes a more complex when including payloads (larger ones even more so). I think its a side-track to discuss queue vs condition variables. Whether a notify is 20 bytes through the network or 8192 bytes doesn't change its design or purpose, only its size. Calling this a creeping feature is quite a leap. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Fri, Nov 13, 2009 at 1:47 PM, Andrew Chernow <ac@esilo.com> wrote: > This is what I think the people's real problem is, the implementation > becomes a more complex when including payloads (larger ones even more so). > I think its a side-track to discuss queue vs condition variables. Whether > a notify is 20 bytes through the network or 8192 bytes doesn't change its > design or purpose, only its size. > > Calling this a creeping feature is quite a leap. It's true that the real creep is having the payload at all rather than not having it. But having fixed-size data also makes the implementation much simpler as well. One person described stuffing the payload with the primary key of the record being invalidated. This means the requirements have just gone from holding at most some small fixed number of records bounded by the number of tables or other shared data structures to holding a large number of records bounded only by the number of records in their tables which is usually much much larger. Now you're talking about making the payloads variable size, which means you need to do free space management within shared pages to keep track of how much space is free and available for reuse. So we've gone from a simple hash table of fixed size entries containing an oid or "name" datum where we expect the hash table to fit in memory and a simple lru can handle old pages that aren't part of the working set to something that's going to look a lot like a database table -- it has to handle reusing space in collections of variable size data and scale up to millions of entries. And I note someone else in the thread was suggesting it needed ACID properties which makes space reuse even more complex and will need something like vacuum to implement it. -- greg
>> Calling this a creeping feature is quite a leap. > > It's true that the real creep is having the payload at all rather than > not having it. Not having the payload at all is like santa showing up without his bag of toys. Instead, you have to drive/fly to the north pole where he just came from to get them. > > One person described stuffing the payload with the primary key of the > record being invalidated. This means the requirements have just gone > from holding at most some small fixed number of records bounded by the > number of tables or other shared data structures to holding a large > number of records bounded only by the number of records in their > tables which is usually much much larger. > > Now you're talking about making the payloads variable size, which > means you need to do free space management within shared pages to keep > track of how much space is free and available for reuse. > > So we've gone from a simple hash table of fixed size entries > containing an oid or "name" datum where we expect the hash table to > fit in memory and a simple lru can handle old pages that aren't part > of the working set to something that's going to look a lot like a > database table -- it has to handle reusing space in collections of > variable size data and scale up to millions of entries. > > And I note someone else in the thread was suggesting it needed ACID > properties which makes space reuse even more complex and will need > something like vacuum to implement it. > I think the original OP was close. The structure can still be fixed length but maybe we can bump it to 8k (BLCKSZ)? -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
On Fri, Nov 13, 2009 at 10:00 AM, Andrew Chernow <ac@esilo.com> wrote: > I think the original OP was close. The structure can still be fixed length > but maybe we can bump it to 8k (BLCKSZ)? The problem with this (which I basically agree with) is that this will greatly increase the size of the queue for all participants of this feature if they use the payload or not. I think it boils down to this: is there a reasonably effective way of making the payload variable length (now or in the future)? If not, let's compromise and maybe go with a larger size, maybe 256 or 512 bytes. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > The problem with this (which I basically agree with) is that this will > greatly increase the size of the queue for all participants of this > feature if they use the payload or not. I think it boils down to > this: is there a reasonably effective way of making the payload > variable length (now or in the future)? If not, let's compromise and > maybe go with a larger size, maybe 256 or 512 bytes. Yeah, if the payload is not variable length then we are not going to be able to make it more than a couple hundred bytes without taking a significant performance hit. (By the way, has anyone yet tried to compare the speed of this implementation to the old code?) regards, tom lane
Merlin Moncure wrote: > On Fri, Nov 13, 2009 at 10:00 AM, Andrew Chernow <ac@esilo.com> wrote: > >> I think the original OP was close. The structure can still be fixed length >> but maybe we can bump it to 8k (BLCKSZ)? >> > > The problem with this (which I basically agree with) is that this will > greatly increase the size of the queue for all participants of this > feature if they use the payload or not. I think it boils down to > this: is there a reasonably effective way of making the payload > variable length (now or in the future)? If not, let's compromise and > maybe go with a larger size, maybe 256 or 512 bytes. > > > My original intention was to have the queue as a circular buffer where the size of the entries was variable, but possibly bounded. Certainly using fixed length records of large size seems somewhat wasteful. But maybe that doesn't fit with what Joachim has done. Incidentally, I'd like to thank Joachim personally for having done this work, that I have been trying to get to for a couple of years, and that circumstances kept conspiring to prevent me from doing. It's been a big monkey on my back. cheers andrew
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > This is BS. The problem is not that someone might do something stupid > with this feature. The problem is that we're making these other use > cases into requirements which will influence the design. This is a > classic "feature creep" situation and the result is normally products > which solve none of the use cases especially well. Feature creep? The payload has been on the roadmap for a long time. I don't recall anyone objecting when Andrew was working on the next version of Listen/Notify around what is probably a couple of years ago now. > Remember this queue has to live in shared memory which is a fixed size > resource. If you're designing a queue mechanism then you would > naturally use something like a queue or priority queue. Right, but we're not discussing a queue, we're discussing the listen/notify system. If people want to mis-use it as a queue when they should be using something else, so be it. Talk of efficiency also seems silly here - using shared memory is already way more efficient than our current listen/notify system. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200911131234 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkr9mL0ACgkQvJuQZxSWSshkvACg6OQ/SRjkvmozzUogTX3weuio 4ZoAnRVfvcrdMmo+iKtkyXmhAlZqViqF =6fzv -----END PGP SIGNATURE-----
"Greg Sabino Mullane" <greg@turnstep.com> writes: > Talk of efficiency also seems silly here - using > shared memory is already way more efficient than our current listen/notify > system. Except that the proposed implementation spills to disk. Particularly if it has to have support for large payloads, it could very well end up being a lot SLOWER than what we have now. regards, tom lane
Tom Lane wrote: > "Greg Sabino Mullane" <greg@turnstep.com> writes: >> Talk of efficiency also seems silly here - using >> shared memory is already way more efficient than our current listen/notify >> system. > > Except that the proposed implementation spills to disk. Particularly if > it has to have support for large payloads, it could very well end up > being a lot SLOWER than what we have now. > True, but do you really consider it to be a common case that the notify system gets soo bogged down that it starts to crawl? The problem would be the collective size of notify structures + payloads and whether that would fit in memory or not. This leads me to believe that the only safety in smaller payloads is *possibly* a smaller chance of bogging it down, but that all depends on the usage pattern of smaller vs. larger payloads which is system specific. -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
> My original intention was to have the queue as a circular buffer where > the size of the entries was variable, but possibly bounded. Certainly > using fixed length records of large size seems somewhat wasteful. > Maybe we should do away with 'spill to disk' all together and either hard-code an overflow behavior or make it a knob. Possible overflow behaviors could be block until space is available, throw an error or silently drop it. Can the size of the shared memory segment for notifications be configurable? That would allow those with large payloads or a huge number of notifications to bump memory to avoid overflow cases. By removing the disk and making shmem usage configurable, I think the notify system would be flexible and could scale nicely. Another added benefit is the payload limit can be much higher than previously considered, because memory/performance concerns are now in the hands of the DBA. > Incidentally, I'd like to thank Joachim personally for having done this > work, +1 -- Andrew Chernow eSilo, LLC every bit counts http://www.esilo.com/
Josh Berkus wrote: > Payloads are also quite useful even in a lossy environment, where you > understand that LISTEN is not a queue. For example, I'd like to be > using LISTEN/NOTIFY for cache invalidation for some applications; if it > misses a few, or double-counts them, it's not an issue. However, I'd > Not sure how missing a few can not be an issue for cache invalidation, but never mind. In the use-cases I've considered I've used a trigger to write a change notification to a table and a notify to indicate that notification record(s) have been changed. The notifications contain copies of the primary keys and the action so the cache processor knows what's changed and the notify is a a wakeup signal. If this is in fact the most common use case, perhaps an alternative approach would be to automate it directly, so that writing the triggers (and using the trigger processing engines) would be unecessary, so:- the queue definition can be automated with reference to the parent table, by DDL stating that one is required- the notification 'name' is effectively the queue name and the subscription says 'tell me when a change note is placed in the queue' Doing this in the database engine core allows a number of potential optimisations:- the mechanism does not require general trigger execution- the queue does not have to be a real table, andcan have custom semantics: it may not actually be necessary to store copies of the primary key data if it can refer to the rows so the data can be retrieved as the queue is consumed- if there are no subscribers to the queue thenthe insertion to it can be elided- if the server crashes, connected consumers should assume caches are invalid and theer is no ACID requirement for the queue data- if the queue fills then slow consumer(s) can be dropped andcan receive a data loss indicator > like to be able to send message like players_updated|45321 where 45321 > is the ID of the player updated. > Indeed. Just a thought, anyway. (FWIW I was initially concerned about the lack of payload, but with any sort of lossy compression I figured it wasn't, actually, and I needed a notification queue) James
On Fri, Nov 13, 2009 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: (By the way, has anyone yet tried to > compare the speed of this implementation to the old code?) I quickly hacked pgbench to take a custom script on connection (for listen), and make pgbench'd 'notify x'; with all clients doing 'listen x'. The old method (measured on a 4 core high performance server) has severe scaling issues due to table bloat (we knew that): ./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql run #1 tps = 1364.948079 (including connections establishing) run #2 tps = 573.988495 (including connections establishing) <vac full pg_listener> ./pgbench -c 50 -t 200 -n -b listen.sql -f notify.sql tps = 844.033498 (including connections establishing) new method on my dual core workstation (max payload 128): ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 16343.012373 (including connections establishing) ./pgbench -c 20 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 7642.104941 (including connections establishing) ./pgbench -c 50 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 3184.049268 (including connections establishing) max payload 2048: ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 12062.906610 (including connections establishing) ./pgbench -c 20 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 7229.505869 (including connections establishing) ./pgbench -c 50 -t 5000 -n -b listen.sql -f notify.sql -hlocalhost postgres tps = 3219.511372 (including connections establishing) getting sporadic 'LOG: could not send data to client: Broken pipe' throughout the test. merlin
On Thu, 12 Nov 2009 11:22:32 -0500 Andrew Chernow <ac@esilo.com> wrote: > > > However I share Greg's concerns that people are trying to use NOTIFY > > as a message queue which it is not designed to be. > > When you have an established libpq connection waiting for notifies it > is not unreasonable to expect/desire a payload. ISTM, the problem is > that the initial design was half-baked. NOTIFY is event-driven, ie. > no polling! > I agree. Wouldn't it make sense to allow the user to pass libpq a callback function which is executed when NOTIFY events happen? Currently we are forced to poll the connection, which means that we'll be checking for a NOTIFY every time we have new data. That just doesn't make sense. -- Alex
On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote: > 3. Every distinct notification is delivered. > Regarding performance, the slru-queue is not fsync-ed to disk These two statements seem to be in opposition. How do you know a notification will be delivered if the queue is non-recoverable? Surely the idea is to send information externally to the database, so why should that stream of info be altered depending upon whether the database crashes? You couldn't use it to reliably update an external cache for example. Why do we need this as well as PgQ? For me, I would need a good reason why this shouldn't be implemented using a normal table, plus bells and whistles. If normal tables don't do what you need, perhaps that's a place to add value. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote: >> 3. Every distinct notification is delivered. >> Regarding performance, the slru-queue is not fsync-ed to disk > These two statements seem to be in opposition. How do you know a > notification will be delivered if the queue is non-recoverable? You misunderstand the requirements. LISTEN notifications are *not* meant to survive a database crash, and never have been. However, so long as both client and server stay up, they must be reliable. If the client has to poll database state because it might have missed a notification, the feature is just a waste of time. regards, tom lane
On Sun, 2009-11-15 at 16:48 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Wed, 2009-11-11 at 22:25 +0100, Joachim Wieland wrote: > >> 3. Every distinct notification is delivered. > >> Regarding performance, the slru-queue is not fsync-ed to disk > > > These two statements seem to be in opposition. How do you know a > > notification will be delivered if the queue is non-recoverable? > > You misunderstand the requirements. LISTEN notifications are *not* > meant to survive a database crash, and never have been. However, > so long as both client and server stay up, they must be reliable. > If the client has to poll database state because it might have > missed a notification, the feature is just a waste of time. Why would it be so important for messages to be reliable if the database is up, yet its OK to lose messages if it crashes? The application must still allow for the case that messages are lost. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2009-11-15 at 16:48 -0500, Tom Lane wrote: >> You misunderstand the requirements. LISTEN notifications are *not* >> meant to survive a database crash, and never have been. However, >> so long as both client and server stay up, they must be reliable. >> If the client has to poll database state because it might have >> missed a notification, the feature is just a waste of time. > Why would it be so important for messages to be reliable if the database > is up, yet its OK to lose messages if it crashes? The application must > still allow for the case that messages are lost. No, that's the point. The design center for LISTEN is that you have a client that needs to respond to changes in the DB state. When it first connects it will issue LISTEN and then (order is important) it will examine the current state of the database. After that it can just wait for NOTIFY to tell it that something interesting has happened. If it crashes, or sees a disconnect indicating that the server has crashed, it goes back to the startup point. No problem. But if it can't be sure that it will get a NOTIFY every time something happens to the DB state, then it has to do active polling of the state instead, and the NOTIFY feature is really worthless to it. This is an entirely useful and reliable feature within these parameters --- the first application I ever wrote using PG relied on NOTIFY to work this way. (In fact it wouldn't be overstating the case to say that I wouldn't be a PG hacker today if it weren't for LISTEN/NOTIFY.) regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> You misunderstand the requirements. LISTEN notifications are *not* >> meant to survive a database crash, and never have been. However, >> so long as both client and server stay up, they must be reliable. >> If the client has to poll database state because it might have >> missed a notification, the feature is just a waste of time. > Why would it be so important for messages to be reliable if > the database is up, yet its OK to lose messages if it crashes? The > application must still allow for the case that messages are lost. Well, there are many use cases. For example, Bucardo uses notifications to let it know that a table has changed. If the database crashes, Bucardo is going to restart - as part of its startup routine, it checks all tables manually for changes, eliminating the need for the NOTIFYs to survive the crash. - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200911160910 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAksBXkAACgkQvJuQZxSWSsjEWACePcT+65HQ0dvx52PjjTkdMzVS ELMAnAhR3Ll016/EwPdizzS5BcsuXaw9 =jds6 -----END PGP SIGNATURE-----
On Sat, Nov 14, 2009 at 11:06 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > The old method (measured on a 4 core high performance server) has > severe scaling issues due to table bloat (we knew that): > ./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql > run #1 tps = 1364.948079 (including connections establishing) > new method on my dual core workstation (max payload 128): > ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres > tps = 16343.012373 (including connections establishing) That looks fine and is similar to my tests where I also see a performance increase of about 10x, and unlike pg_listener it is constant. > getting sporadic 'LOG: could not send data to client: Broken pipe' > throughout the test. This looks like the server is trying to send a notification down to the client but the client has already terminated the connection... Joachim
On Mon, Nov 16, 2009 at 4:41 PM, Joachim Wieland <joe@mcknight.de> wrote: > On Sat, Nov 14, 2009 at 11:06 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> The old method (measured on a 4 core high performance server) has >> severe scaling issues due to table bloat (we knew that): >> ./pgbench -c 10 -t 1000 -n -b listen.sql -f notify.sql >> run #1 tps = 1364.948079 (including connections establishing) > >> new method on my dual core workstation (max payload 128): >> ./pgbench -c 10 -t 10000 -n -b listen.sql -f notify.sql -hlocalhost postgres >> tps = 16343.012373 (including connections establishing) > > That looks fine and is similar to my tests where I also see a > performance increase of about 10x, and unlike pg_listener it is > constant. old method scaled (badly) on volume of notifications and your stuff seems to scale based on # of client's sending simultaneous notifications. Well, you're better all day long, but it shows that your fears regarding locking were not completely unfounded. Do the Burcardo people have any insights on the #of simultaneous notifies are generated from different backends? merlin
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > old method scaled (badly) on volume of notifications and your stuff > seems to scale based on # of client's sending simultaneous > notifications. Well, you're better all day long, but it shows that > your fears regarding locking were not completely unfounded. Do the > Burcardo people have any insights on the #of simultaneous notifies are > generated from different backends? Very low. On a busy system I know of there are about 90 entries in pg_listener, and I would guess that the maximum rate of simulataneous notifies is not more than 3 per second, tops. If someone knows an easy way to measure such a thing and is really curious, I can see about getting better numbers. - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation PGP Key: 0x14964AC8 200911162127 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAksCCjsACgkQvJuQZxSWSsgTogCfS5Xg8N2JhsUpi2r96IbxX+Tm pMsAnAktBVkEblzx6Ux/netXkP9u4AVG =SO/k -----END PGP SIGNATURE-----