Thread: Sending notifications from the master to the standby
People have always expressed interest in $subject, so I wondered how hard it could possibly be and came up with the attached patch. Notifications that are generated on the master and are forwarded to the standby can be used as a convenient way to find out which changes have already made it to the standby. The idea would be that you run a transaction on the master, add a "NOTIFY changes_made", and listen on the standby for this event. Once it gets delivered, you know that your transaction got replayed to the standby. Note that this feature is only about LISTEN on the standby, it still doesn't allow sending NOTIFYs out from the standby. As a reminder, the current implementation of notifications (LISTEN/NOTIFY) in a few words is: - a transaction that executes "NOTIFY channel, payload" adds the transaction to backend-local memory - upon commit, it inserts the notifications along with its transaction id into a large SLRU mapped ring buffer and signals any listening backend - each backend that's listening has a pointer into this ring buffer. After each transaction, the backend starts reading from this pointer position to the end of the ring buffer. It delivers all matching notifications to its frontend if the transaction that has inserted them is known to have committed. In the patch I added a new WAL message type, XLOG_NOTIFY that writes out WAL records when the notifications are written into the pages of the SLRU ring buffer. Whenever an SLRU page is found to be full, a new WAL record will be created, that's just a more or less arbitrary form of batching a bunch of them together but that's easy to do and most often, I think there won't be more than at most a few record per transaction anyway. The recovery process on the client side adds the notifications into the standby's SLRU ring buffer and once the last notification has been added (which might be after a couple more WAL records), it signals the listening backends. Theoretically we could also run into a full queue situation on the standby: Imagine a long-running transaction doesn't advance its pointer in the ring buffer and no new notifications can be stored in the buffer. The patch introduces a new type of recovery conflict for this reason. One further optimization (that is not included for now) would be to keep track of how many backends are actually listening on some channel and if nobody is listening, discard incoming notifications.
Attachment
Joachim Wieland <joe@mcknight.de> writes: > [ send NOTIFYs to slaves by means of: ] > In the patch I added a new WAL message type, XLOG_NOTIFY that writes > out WAL records when the notifications are written into the pages of > the SLRU ring buffer. Whenever an SLRU page is found to be full, a new > WAL record will be created, that's just a more or less arbitrary form > of batching a bunch of them together but that's easy to do and most > often, I think there won't be more than at most a few record per > transaction anyway. I'm having a hard time wrapping my mind around why you'd do it that way. ISTM there are two fairly serious problems: 1. Emitting WAL records for NOTIFY traffic results in significantly more overhead, with no benefit whatever, for existing non-replicated NOTIFY-using applications. Those folk are going to see a performance degradation, and they're going to complain. 2. Batching NOTIFY traffic will result in a delay in receipt, which will annoy anybody who's trying to make actual use of the notifications on standby servers. The worst case here happens if notify traffic on the master is bursty: the last few messages in a burst might not get to the slave for a long time, certainly long after the commits that the messages were supposed to be telling people about. So this design is non-optimal both for existing uses and for the proposed new uses, which means nobody will like it. You could ameliorate #1 by adding a GUC that determines whether NOTIFY actually writes WAL, but that's pretty ugly. In any case ISTM that problem #2 means this design is basically broken. I wonder whether it'd be practical to not involve WAL per se in this at all, but to transmit NOTIFY messages by having walsender processes follow the notify stream (as though they were listeners) and send the notify traffic as a separate message stream interleaved with the WAL traffic. We already have, as of a few days ago, the concept of additional traffic in the walsender stream besides the WAL data itself, so adding notify traffic as another message type should be straightforward. It might be a bit tricky to get walreceivers to inject the data into the slave-side ring buffer at the right time, ie, not until after the commit a given message describes has been replayed; but I don't immediately see a reason to think that's infeasible. Going in this direction would mean that slave-side LISTEN only works when using walsender/walreceiver, and not with old-style log shipping. But personally I don't see a problem with that. If you're trying to LISTEN you probably want pretty up-to-date data anyway. regards, tom lane
On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joachim Wieland <joe@mcknight.de> writes: >> [ send NOTIFYs to slaves by means of: ] Good idea. > I wonder whether it'd be practical to not involve WAL per se in this > at all, but to transmit NOTIFY messages by having walsender processes > follow the notify stream (as though they were listeners) and send the > notify traffic as a separate message stream interleaved with the WAL > traffic. We already have, as of a few days ago, the concept of > additional traffic in the walsender stream besides the WAL data itself, > so adding notify traffic as another message type should be > straightforward. Also good idea. > It might be a bit tricky to get walreceivers to inject > the data into the slave-side ring buffer at the right time, ie, not > until after the commit a given message describes has been replayed; > but I don't immediately see a reason to think that's infeasible. When transaction commits it would use full-size commit records and set a (new) flag in xl_xact_commit.xinfo to show the commit is paired with notify traffic. Get messages in walreceiver.c XLogWalRcvProcessMsg() and put them in a shared hash table. Messages would need to contain xid of notifying transaction and other info needed for LISTEN. When we hit xact.c xact_redo_commit() on standby we'd check for messages in the hash table if the notify flag is set and execute the normal notify code as if the NOTIFY had run locally on the standby. We can sweep the hash table clean of any old messages each time we run ProcArrayApplyRecoveryInfo() Add new message type to walprotocol.h. Message code 'L' appears to be available. Suggest we add something to initial handshake from standby to say "please send me notify traffic", which we can link to a parameter that defines size of standby_notify_buffer. We don't want all standbys to receive such traffic unless they really want it and pg_basebackup probably doesn't want it either. If you wanted to get really fancy you could send only some of the traffic to each standby based on a hash or roundrobin algorithm, so we can spread the listeners across multiple standbys. I'll be your reviewer, if you want. > Going in this direction would mean that slave-side LISTEN only works > when using walsender/walreceiver, and not with old-style log shipping. > But personally I don't see a problem with that. If you're trying to > LISTEN you probably want pretty up-to-date data anyway. Which fits the expected use case also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 10, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So this design is non-optimal both for existing uses and for the > proposed new uses, which means nobody will like it. You could > ameliorate #1 by adding a GUC that determines whether NOTIFY actually > writes WAL, but that's pretty ugly. In any case ISTM that problem #2 > means this design is basically broken. I chose to do it this way because it seemed like the most natural way to do it (which of course doesn't mean it's the best) :-). I agree that there should be a way to avoid the replication of the NOTIFYs. Regarding your second point though, remember that on the master we write notifications to the queue in pre-commit. And we also don't interleave notifications of different transactions. So once the commit record makes it to the standby, all the notifications are already there, just as on the master. In a burst of notifications, both solutions should more or less behave the same way but yes, the one involving the WAL file would be slower as it goes to the file system and back. > I wonder whether it'd be practical to not involve WAL per se in this > at all, but to transmit NOTIFY messages by having walsender processes > follow the notify stream (as though they were listeners) and send the > notify traffic as a separate message stream interleaved with the WAL > traffic. Agreed, having walsender/receiver work as NOTIFY proxies is kinda smart... Joachim
On Tue, Jan 10, 2012 at 12:56 PM, Joachim Wieland <joe@mcknight.de> wrote: > I chose to do it this way because it seemed like the most natural way > to do it (which of course doesn't mean it's the best) :-). If its any consolation its exactly how I would have done it also up until about 2 months ago, and I remember discussing almost exactly the design you presented with someone in Rome last year. Anyway its a good feature, so I hope you have time. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It might be a bit tricky to get walreceivers to inject >> the data into the slave-side ring buffer at the right time, ie, not >> until after the commit a given message describes has been replayed; >> but I don't immediately see a reason to think that's infeasible. > [ Simon sketches a design for that ] Seems a bit overcomplicated. I was just thinking of having walreceiver note the WAL endpoint at the instant of receipt of a notify message, and not release the notify message to the slave ring buffer until WAL replay has advanced that far. You'd need to lay down ground rules about how the walsender times the insertion of notify messages relative to WAL in its output. But I don't see the need for either explicit markers in the WAL stream or a hash table. Indeed, a hash table scares me because it doesn't clearly guarantee that notifies will be released in arrival order. > Suggest we add something to initial handshake from standby to say > "please send me notify traffic", +1 on that. regards, tom lane
On Tue, Jan 10, 2012 at 4:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On Tue, Jan 10, 2012 at 5:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It might be a bit tricky to get walreceivers to inject >>> the data into the slave-side ring buffer at the right time, ie, not >>> until after the commit a given message describes has been replayed; >>> but I don't immediately see a reason to think that's infeasible. > >> [ Simon sketches a design for that ] > > Seems a bit overcomplicated. I was just thinking of having walreceiver > note the WAL endpoint at the instant of receipt of a notify message, > and not release the notify message to the slave ring buffer until WAL > replay has advanced that far. You'd need to lay down ground rules about > how the walsender times the insertion of notify messages relative to > WAL in its output. You have to store the messages somewhere until they're needed. If that somewhere isn't on the standby, very close to the Startup process then its going to be very slow. Putting a marker in the WAL stream guarantees arrival order. The hash table was just a place to store them until they're needed, could be a ring buffer as well. Inserts into the slave ring buffer already have an xid on them, so the test will probably already cope with messages inserted but for which the parent xid has not committed. The only problem is coping with possible out of sequence messages. > But I don't see the need for either explicit markers > in the WAL stream or a hash table. Indeed, a hash table scares me > because it doesn't clearly guarantee that notifies will be released in > arrival order. The hash table is clearly not the thing providing an arrival order guarantee, it was just a cache. You have a few choices: (1) you either send the message while holding an exclusive lock, or (2) you send them as they come and buffer them, then reorder them using the WAL log sequence since that matches the original commit sequence. Or (3) add a sequence number to the messages sent by WALSender, so that the WALReceiver can buffer them locally and insert them in the correct order into the normal ring buffer - so in (3) the message sequence and the WAL sequence match, but the mechanism is different. (1) is out because the purpose of offloading to the standby is to give the master more capcity. If we slow it down in order to serve the standby we're doing things the wrong way around. I was choosing (2), maybe you prefer (3) or another design entirely. They look very similar to me and about the same complexity, its just copying data and preserving sequence. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > [ Tom sketches a design ] > Seems a bit overcomplicated. I was just thinking of having walreceiver > note the WAL endpoint at the instant of receipt of a notify message, > and not release the notify message to the slave ring buffer until WAL > replay has advanced that far. How about this: We mark a notify message specially if it is the last message sent by a transaction and also add a flag to commit/abort-records, indicating whether or not the transaction has sent notifys. Now if such a last message is being put into the regular ring buffer on the standby and the xid is known to have committed or aborted, signal the backends. Also signal from a commit/abort-record if the flag is set. If the notify messages make it to the standby first, we just put messages of a not-yet-committed transaction into the queue, just as on the master. Listeners will get signaled when the commit record arrives. If the commit record arrives first, we signal, but the listeners won't find anything (at least not the latest notifications). When the last notify of that transaction finally arrives, the transaction is known to have committed and the listeners will get signaled. What could still happen is that the standby receives notifys, the commit message and more notifys. Listeners would still eventually get all the messages but potentially not all of them at once. Is this a problem? If so, then we could add a special "stop reading"-record into the queue before we write the notifys, that we subsequently change into a "continue reading"-record once all notifications are in the queue. Readers would treat a "stop reading" record just like a not-yet-committed transaction and ignore a "continue reading" record. >> Suggest we add something to initial handshake from standby to say >> "please send me notify traffic", > > +1 on that. From what you said I imagined this walsender listener as a regular listener that listens on the union of all sets of channels that anybody is listening on on the standby, with the LISTEN transaction on the standby return from commit once the listener is known to have been set up on the master. Joachim
Joachim Wieland <joe@mcknight.de> writes: > On Tue, Jan 10, 2012 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> Suggest we add something to initial handshake from standby to say >>> "please send me notify traffic", >> +1 on that. > From what you said I imagined this walsender listener as a regular > listener that listens on the union of all sets of channels that > anybody is listening on on the standby, with the LISTEN transaction on > the standby return from commit once the listener is known to have been > set up on the master. This seems vastly overcomplicated too. I'd just vote for a simple yes/no flag, so that receivers that have no interest in notifies don't have to deal with them. regards, tom lane
BTW ... it occurs to me to ask whether we really have a solid use-case for having listeners attached to slave servers. I have personally never seen an application for LISTEN/NOTIFY in which the listeners were entirely read-only. Even if there are one or two cases out there, it's not clear to me that supporting it is worth the extra complexity that seems to be needed. regards, tom lane
On Wed, Jan 11, 2012 at 4:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW ... it occurs to me to ask whether we really have a solid use-case > for having listeners attached to slave servers. I have personally never > seen an application for LISTEN/NOTIFY in which the listeners were > entirely read-only. Even if there are one or two cases out there, it's > not clear to me that supporting it is worth the extra complexity that > seems to be needed. The idea is to support external caches that re-read the data when it changes. If we can do that from the standby then we offload from the master. Yes, there are other applications for LISTEN/NOTIFY and we wouldn't be able to support them all with this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Tom, > BTW ... it occurs to me to ask whether we really have a solid use-case > for having listeners attached to slave servers. I have personally never > seen an application for LISTEN/NOTIFY in which the listeners were > entirely read-only. Even if there are one or two cases out there, it's > not clear to me that supporting it is worth the extra complexity that > seems to be needed. Actually, I've seen requests for it from my clients and on IRC. Not sure how serious those are, but users have brought it up. Certainly users intuitively think they should be able to LISTEN on a standby, and are surprised when they find out they can't. The basic idea is that if we can replicate LISTENs, then you can use replication as a simple distributed (and lossy) queueing system. This is especially useful if the replica is geographically distant, and there are a lot of listeners. The obvious first use case for this is for cache invalidation. For example, we have one application where we're using Redis to queue cache invalidation messages; if LISTEN/NOTIFY were replicated, we could use it instead and simplify our infrastructure. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> BTW ... it occurs to me to ask whether we really have a solid use-case >> for having listeners attached to slave servers. I have personally never >> seen an application for LISTEN/NOTIFY in which the listeners were >> entirely read-only. Even if there are one or two cases out there, it's >> not clear to me that supporting it is worth the extra complexity that >> seems to be needed. > The basic idea is that if we can replicate LISTENs, then you can use > replication as a simple distributed (and lossy) queueing system. Well, this is exactly what I don't believe. A queueing system requires that recipients be able to remove things from the queue. You can't do that on a slave server, because you can't make any change in the database that would be visible to other users. > The obvious first use case for this is for cache invalidation. Yeah, upthread Simon pointed out that propagating notifies would be useful for flushing caches in applications that watch the database in a read-only fashion. I grant that such a use-case is technically possible within the limitations of a slave server; I'm just dubious that it's a sufficiently attractive use-case to justify the complexity and future maintenance costs of the sort of designs we are talking about. Or in other words: so far, cache invalidation is not the "first" use-case, it's the ONLY POSSIBLE use-case. That's not useful enough. regards, tom lane
> Yeah, upthread Simon pointed out that propagating notifies would be > useful for flushing caches in applications that watch the database in a > read-only fashion. I grant that such a use-case is technically possible > within the limitations of a slave server; I'm just dubious that it's a > sufficiently attractive use-case to justify the complexity and future > maintenance costs of the sort of designs we are talking about. Or in > other words: so far, cache invalidation is not the "first" use-case, > it's the ONLY POSSIBLE use-case. That's not useful enough. Well, cache invalidation is a pretty common task; probably more than 50% of all database applications need to do it. Note that we're not just talking about memcached for web applications here. For example, one of the companies quoted for PostgreSQL 9.0 release uses LISTEN/NOTIFY to inform remote devices (POS systems) that there's new data available for them. That's a form of cache invalidation. It's certainly a more common design pattern than using XML in the database. However, there's the question of whether or not this patch actually allows a master-slave replication system to support more Listeners more efficiently than having them all simply listen to the master. And what impact it has on the performance of LISTEN/NOTIFY on standalone systems. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 11 January 2012 23:51, Josh Berkus <josh@agliodbs.com> wrote: > >> Yeah, upthread Simon pointed out that propagating notifies would be >> useful for flushing caches in applications that watch the database in a >> read-only fashion. I grant that such a use-case is technically possible >> within the limitations of a slave server; I'm just dubious that it's a >> sufficiently attractive use-case to justify the complexity and future >> maintenance costs of the sort of designs we are talking about. Or in >> other words: so far, cache invalidation is not the "first" use-case, >> it's the ONLY POSSIBLE use-case. That's not useful enough. > > Well, cache invalidation is a pretty common task; probably more than 50% > of all database applications need to do it. I agree that it would be nice to support this type of cache invalidation - without commenting on the implementation, I think that the concept is very useful, and of immediate benefit to a significant number of people. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Jan 11, 2012 at 11:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The obvious first use case for this is for cache invalidation. > > Yeah, upthread Simon pointed out that propagating notifies would be > useful for flushing caches in applications that watch the database in a > read-only fashion. I grant that such a use-case is technically possible > within the limitations of a slave server; I'm just dubious that it's a > sufficiently attractive use-case to justify the complexity and future > maintenance costs of the sort of designs we are talking about. Or in > other words: so far, cache invalidation is not the "first" use-case, > it's the ONLY POSSIBLE use-case. That's not useful enough. Many people clearly do think this is useful. I personally don't think it will be that complex. I'm willing to review and maintain it if the patch works the way we want it to. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> Many people clearly do think this is useful. It also comes under the heading of "avoiding surprising behavior". That is, users instinctively expect to be able to LISTEN on standbys, and are surprised when they can't. > I personally don't think it will be that complex. I'm willing to > review and maintain it if the patch works the way we want it to. > I think we need some performance testing for the review for it to be valid. 1) How does this patch affect the speed and throughput of LISTEN/NOTIFY on a standalone server? 2) Can we actually attach more LISTENers to multiple standbys than we could to a single Master? Unfortunately, I don't have an application which can LISTEN in a way which doesn't eclipse any differences in througput or response time we would see on the DB side. Does anyone? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com