Thread: Think we need major revisions in async.c...
Well, I've spent the last several hours studying how NOTIFY works, and I think it's a mess :-(. We need some significant revisions in there. I have not been able to trace an exact flow of control that would yield the NOTIFY-message-interspersed-in-data result that I saw this afternoon. However, I am pretty certain that I know the basic cause of the problem. The risky aspect of NOTIFY is that very substantial amounts of processing (an entire transaction, in fact) are launched from a signal handler routine. As a rule of thumb, doing more than setting a flag variable in a signal handler is a big NO NO, because you don't have any idea what bit of code you've interrupted or what you can safely call. (I have a strong suspicion, in fact, that the behavior I saw arose from the signal interrupt occuring during the backend's fflush() of data destined for the frontend at the end of a query cycle. I dunno how re-entrant the stdio library is on my machine, but it seems quite possible that it behaves undesirably if one attempts to queue more output data from a routine that interrupts fflush()...) The Postgres code tries to protect itself against this sort of problem by having the SIGUSR2 handler not do anything except set a flag if the current transaction state says that a transaction is in progress. This is almost good enough, but evidently not quite. The first issue is that some query initialization and finalization code in PostgresMain happens outside the StartTransactionCommand...CommitTransactionCommand zone of safety. (In particular, the final fflush of data bound for the frontend.) The second issue is that async.c itself sometimes starts and ends transactions, by calling xact.c, and yet xact.c also calls async.c as part of transaction commit. At least two levels of recursion are possible in there. I have not convinced myself whether the bug I encountered is due to some subtly incorrect ordering of events in that interaction, or whether it's just due to non-reentrancy in stdio. But I no longer care. This code is demonstrably buggy, and it is too complex to understand or maintain anyway. I propose to rewrite it. I propose that: 1. The link between inbound-notify processing and transaction start/commit should be removed. Having inbound-notify execution called from transaction commit is too complicated because that notify execution itself may have to induce a transaction start/commit. Instead, inbound notify execution ought to be triggered from the PostgresMain main loop. This has the additional advantage that we can have more control over just which parts of the main loop allow or disallow notify execution directly from the signal handler. The main loop should now contain something like: for (;;) { ReadyForQuery(); // contains fflush() call AllowNotifyInterrupt(); ReadCommand(); DisableNotifyInterrupt(); execute command, including transaction start/commit } so that the SIGUSR2 interrupt is allowed to do something serious only in the *very* tiny window where we are just waiting for or reading the next frontend query. (Since the interrupt has no reason to touch the state of stdin, this should be safe enough.) An interrupt occuring any other time will just set a flag, which will be read by AllowNotifyInterrupt --- if an interrupt has occured since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will carry out the necessary processing. Thus, all inbound-notify execution effectively happens at the outer loop --- either while waiting for frontend input, or just before doing so. 2. The innards of NOTIFY execution are unnecessarily complicated and inefficient, quite aside from the above problem of controlling interrupt safety. The NOTIFY statement itself (routine Async_Notify) makes one pass over the pg_listener relation, writing a 1 into the notification field of each tuple with a matching relname. Then at commit time, we have to make another pass over the pg_listener relation to issue signals (and/or send a message to our own front end, if we're NOTIFYing something we also LISTEN on, which is pretty common in my applications anyway). This is unnecessary processing; worse, if the NOTIFY is done early in a long transaction block, we end up holding a write lock on pg_listen for a long time. Since pg_listen is a global resource, this can effectively bring all the rest of the backends to a halt! The right way to do this is for NOTIFY to simply add the target relation name to a list of pending notifies (as indeed it does already). Then when we are ready to commit, we scan pg_listen *once*, performing tuple updates as needed and also sending signals and/or frontend messages. (Note this is for an *outbound* notify, which we do want to have happen during transaction commit. The real problem with the current async.c is that it tries to combine outbound and inbound notify code. Keeping them separate will be better.) This approach takes less processing, and it also means that we need not do anything so risky as RelationUnsetLockForWrite() to gain performance. The pg_listen lock will be held for only a very short time before we release it as a normal part of commit. Moreover, a self-notify need never modify the pg_listen table at all, saving cycles and opening the door for another feature (see below). If the transaction is aborted rather than committed, we need only discard the list of pending notify names. pg_listen is never touched in that case. 3. While I am at it, I would like to add a feature that I've been wanting for a while. It seems to me that the be_pid field provided in a PGnotify message ought not always be your own backend's PID --- what use is that? What it ought to be is the PID of the *signaling* backend, or at least of one of the signaling backends other than your own. I have a bunch of applications that all read and write the same tables and let each other know about updates by NOTIFY messages. When one of these apps gets a NOTIFY, it has to re-read the table *even if the notify was the one it just issued*, and the updates in question are the ones it just wrote out. This is a waste of cycles, but right now there is no choice because you cannot tell whether the NOTIFY you get is due (solely) to your own NOTIFY or (perhaps in part) to someone else's. What I'd like to do is to have cross-backend notifies write the originating backend's PID into the "notification" field, rather than just a "1". The receiving backend can then send that value to its frontend, as opposed to sending its own PID as it will do in the self-notify case. In this way the frontend can unambiguously distinguish its own NOTIFY bouncing back from a NOTIFY from another client. (If several other clients send NOTIFY at about the same time, you may get only one message with only one of their PIDs, since there is only room for one originating PID in pg_listen. But it will be kept distinct from any message resulting from a NOTIFY of your own.) Note that we already made the backend's own PID available to the frontend as part of the CANCEL-related changes, so that part of the info is there for free. A more radical solution would be to suppress self-notifies altogether, but that is a change in semantics that would likely break some existing apps. I will be satisfied if the frontend can tell the difference between a self-notify and an incoming notify. Any objections? If there's something crucial I've missed in all this, better let me know ASAP! I know that it's a tad late in the beta cycle to be making such significant changes, but my applications really need NOTIFY to work and work reliably. We have to have this *now*. I hope y'all will bear with me. regards, tom lane
> What I'd like to do is to have cross-backend notifies write the > originating backend's PID into the "notification" field, rather than > just a "1". The receiving backend can then send that value to its > frontend, as opposed to sending its own PID as it will do in the > self-notify case. In this way the frontend can unambiguously > distinguish its own NOTIFY bouncing back from a NOTIFY from another > client. (If several other clients send NOTIFY at about the same time, > you may get only one message with only one of their PIDs, since there > is only room for one originating PID in pg_listen. But it will be kept > distinct from any message resulting from a NOTIFY of your own.) Note > that we already made the backend's own PID available to the frontend as > part of the CANCEL-related changes, so that part of the info is there > for free. > > A more radical solution would be to suppress self-notifies altogether, > but that is a change in semantics that would likely break some existing > apps. I will be satisfied if the frontend can tell the difference > between a self-notify and an incoming notify. > > > Any objections? If there's something crucial I've missed in all this, > better let me know ASAP! > > I know that it's a tad late in the beta cycle to be making such significant > changes, but my applications really need NOTIFY to work and work reliably. > We have to have this *now*. I hope y'all will bear with me. Sounds good. You have worked with the Notify code more than anyone else, so go ahead. You are fixing a bug, and that is fine at this stage. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > Well, I've spent the last several hours studying how NOTIFY works, and > I think it's a mess :-(. We need some significant revisions in there. I agreee. I tried to improve it but it's really a mess. > I have not been able to trace an exact flow of control that would > yield the NOTIFY-message-interspersed-in-data result that I saw this > afternoon. However, I am pretty certain that I know the basic cause > of the problem. The risky aspect of NOTIFY is that very substantial > amounts of processing (an entire transaction, in fact) are launched from > a signal handler routine. As a rule of thumb, doing more than setting a > flag variable in a signal handler is a big NO NO, because you don't have > any idea what bit of code you've interrupted or what you can safely > call. (I have a strong suspicion, in fact, that the behavior I saw > arose from the signal interrupt occuring during the backend's fflush() I suspected this could happen but don't really know how signals are handled. > of data destined for the frontend at the end of a query cycle. I dunno > how re-entrant the stdio library is on my machine, but it seems quite > possible that it behaves undesirably if one attempts to queue more > output data from a routine that interrupts fflush()...) > > The Postgres code tries to protect itself against this sort of problem > by having the SIGUSR2 handler not do anything except set a flag if the > current transaction state says that a transaction is in progress. This > is almost good enough, but evidently not quite. The first issue is that > some query initialization and finalization code in PostgresMain happens > outside the StartTransactionCommand...CommitTransactionCommand zone of > safety. (In particular, the final fflush of data bound for the > frontend.) The second issue is that async.c itself sometimes starts and > ends transactions, by calling xact.c, and yet xact.c also calls async.c > as part of transaction commit. At least two levels of recursion are > possible in there. I have not convinced myself whether the bug I I believe that the transaction code can avoid this type of recursion. > encountered is due to some subtly incorrect ordering of events in that > interaction, or whether it's just due to non-reentrancy in stdio. > But I no longer care. This code is demonstrably buggy, and it is too > complex to understand or maintain anyway. I propose to rewrite it. > > I propose that: > > 1. The link between inbound-notify processing and transaction > start/commit should be removed. Having inbound-notify execution called > from transaction commit is too complicated because that notify execution > itself may have to induce a transaction start/commit. Instead, inbound > notify execution ought to be triggered from the PostgresMain main loop. The reason why it is done at commit is that it is the best place where we can know if the transaction completed or not. If we move the notify code in the main loop we must also keep some flag which tell us if and when the notify can be done. > This has the additional advantage that we can have more control over > just which parts of the main loop allow or disallow notify execution > directly from the signal handler. The main loop should now contain > something like: > > for (;;) { > > ReadyForQuery(); // contains fflush() call > > AllowNotifyInterrupt(); > > ReadCommand(); > > DisableNotifyInterrupt(); > > execute command, including transaction start/commit > } > > so that the SIGUSR2 interrupt is allowed to do something serious > only in the *very* tiny window where we are just waiting for or > reading the next frontend query. (Since the interrupt has no > reason to touch the state of stdin, this should be safe enough.) > An interrupt occuring any other time will just set a flag, which > will be read by AllowNotifyInterrupt --- if an interrupt has occured > since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will > carry out the necessary processing. Thus, all inbound-notify execution > effectively happens at the outer loop --- either while waiting for > frontend input, or just before doing so. > > > 2. The innards of NOTIFY execution are unnecessarily complicated and > inefficient, quite aside from the above problem of controlling interrupt > safety. The NOTIFY statement itself (routine Async_Notify) makes one > pass over the pg_listener relation, writing a 1 into the notification > field of each tuple with a matching relname. Then at commit time, we > have to make another pass over the pg_listener relation to issue signals > (and/or send a message to our own front end, if we're NOTIFYing > something we also LISTEN on, which is pretty common in my applications > anyway). This is unnecessary processing; worse, if the NOTIFY is done > early in a long transaction block, we end up holding a write lock on > pg_listen for a long time. Since pg_listen is a global resource, this > can effectively bring all the rest of the backends to a halt! This is the standard behaviour of the notify code. If you set the flag notifyunlock in .../data/pg_options the write lock is released immediately after updating the table, but I suspect this is the reason of the duplicate oids. For this reason I left the flag disabled by default, and maybe this explains why you don't get duplicate oids in 6.4, but I have enabled it in a production environment without problems at all. > The right way to do this is for NOTIFY to simply add the target relation > name to a list of pending notifies (as indeed it does already). Then > when we are ready to commit, we scan pg_listen *once*, performing tuple > updates as needed and also sending signals and/or frontend messages. > (Note this is for an *outbound* notify, which we do want to have happen > during transaction commit. The real problem with the current async.c is > that it tries to combine outbound and inbound notify code. Keeping them The real problem is that inbound notifiy can happen at any moment from the signal handler. > separate will be better.) This approach takes less processing, and it > also means that we need not do anything so risky as > RelationUnsetLockForWrite() to gain performance. The pg_listen lock > will be held for only a very short time before we release it as a normal > part of commit. Moreover, a self-notify need never modify the pg_listen > table at all, saving cycles and opening the door for another feature > (see below). But in general you must still get the write lock to prevent deadlocks. > If the transaction is aborted rather than committed, we need only > discard the list of pending notify names. pg_listen is never touched > in that case. This sounds reasonable. > 3. While I am at it, I would like to add a feature that I've been > wanting for a while. It seems to me that the be_pid field provided in a > PGnotify message ought not always be your own backend's PID --- what use > is that? What it ought to be is the PID of the *signaling* backend, or > at least of one of the signaling backends other than your own. I have > a bunch of applications that all read and write the same tables and let > each other know about updates by NOTIFY messages. When one of these > apps gets a NOTIFY, it has to re-read the table *even if the notify was > the one it just issued*, and the updates in question are the ones it > just wrote out. This is a waste of cycles, but right now there is no > choice because you cannot tell whether the NOTIFY you get is due > (solely) to your own NOTIFY or (perhaps in part) to someone else's. The problem is that notify has the same semantic of unix signals. You can have more notifies for the same table from many backends but you only know that at least one has happened. What you propose should be some sort of message queuing between backends in which each message is notified separately to the other backend In this case it would also be useful to pass some arbitrary text to the listening backend. > What I'd like to do is to have cross-backend notifies write the > originating backend's PID into the "notification" field, rather than > just a "1". The receiving backend can then send that value to its > frontend, as opposed to sending its own PID as it will do in the > self-notify case. In this way the frontend can unambiguously > distinguish its own NOTIFY bouncing back from a NOTIFY from another > client. (If several other clients send NOTIFY at about the same time, > you may get only one message with only one of their PIDs, since there > is only room for one originating PID in pg_listen. But it will be kept > distinct from any message resulting from a NOTIFY of your own.) Note > that we already made the backend's own PID available to the frontend as > part of the CANCEL-related changes, so that part of the info is there > for free. > > A more radical solution would be to suppress self-notifies altogether, > but that is a change in semantics that would likely break some existing > apps. I will be satisfied if the frontend can tell the difference > between a self-notify and an incoming notify. I wouldn't change this. > Any objections? If there's something crucial I've missed in all this, > better let me know ASAP! > > I know that it's a tad late in the beta cycle to be making such significant > changes, but my applications really need NOTIFY to work and work reliably. > We have to have this *now*. I hope y'all will bear with me. > > regards, tom lane > Massimo Dal Zotto +----------------------------------------------------------------------+ | Massimo Dal Zotto email: dz@cs.unitn.it | | Via Marconi, 141 phone: ++39-461-534251 | | 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ | | Italy pgp: finger dz@tango.cs.unitn.it | +----------------------------------------------------------------------+
Massimo Dal Zotto <dz@cs.unitn.it> writes: >> 1. The link between inbound-notify processing and transaction >> start/commit should be removed. Having inbound-notify execution called >> from transaction commit is too complicated because that notify execution >> itself may have to induce a transaction start/commit. Instead, inbound >> notify execution ought to be triggered from the PostgresMain main loop. > The reason why it is done at commit is that it is the best place where we > can know if the transaction completed or not. If we move the notify code > in the main loop we must also keep some flag which tell us if and when the > notify can be done. No, you're confusing outbound and inbound notify. Outbound notify --- ie, actually doing something about a NOTIFY command issued by the user --- should indeed be done from transaction commit. If the transaction is aborted, we want the NOTIFY to never take effect, just like all the other side-effects of the transaction. What I proposed for that was that we could simplify matters by doing both the signal-sending and pg_listener-table-updating work at the same time during transaction commit, rather than changing the table and issuing the signals in separate passes. Inbound notify, ie, receiving messages from other backends, is a different story. There is no question of not accepting the signals; they've already been committed by some other backend. Now for implementation reasons we need to hold off responding to an inbound notify until we are outside of any transaction. (Otherwise, our attempt to mark the notify accepted by changing pg_listener could be undone by a later transaction abort; not what we want.) But that doesn't mean the response processing has to be called from xact.c. Better to do it at the main loop, with a check to see if we are inside a transaction block or not. Otherwise we have to depend on stdio to be re-entrant. Also, xact.c is already tricky enough without having to worry about the exact order of operations in straightline code. I spent some time on Saturday trying to see if the variables the signal handler was looking at were being changed before the transaction was quite complete --- and I'm still not sure whether it was really safe, given the potential for mutual recursion between async.c and xact.c in the old code. Writing code that interacts correctly with an ISR is tricky stuff, so you really want to do it in a place that doesn't have much of anything else to worry about. > This is the standard behaviour of the notify code. If you set the flag > notifyunlock in .../data/pg_options the write lock is released immediately > after updating the table, but I suspect this is the reason of the > duplicate oids. For this reason I left the flag disabled by default, and > maybe this explains why you don't get duplicate oids in 6.4, but I have > enabled it in a production environment without problems at all. Neither NotifyUnlock nor NotifyHack are necessary in the rewritten code... >> This is a waste of cycles, but right now there is no >> choice because you cannot tell whether the NOTIFY you get is due >> (solely) to your own NOTIFY or (perhaps in part) to someone else's. > What you propose should be some sort of message queuing between backends > in which each message is notified separately to the other backend In this > case it would also be useful to pass some arbitrary text to the listening > backend. Not really. For the applications I'm interested in, there are only two interesting cases: a NOTIFY from some other client vs. one of your own NOTIFYs bouncing back at you. With the code as I now have it, you can differentiate those cases. You're right that a more general message-passing facility might be useful. On the other hand, one can be built by users using an ordinary table and NOTIFY to wake up sleeping respondents. So I'm not sure we need to put more stuff in the backend for that. I have been testing this new code and it seems to work fine, except for the question of interlocks against "select * from pg_listener" --- and if that's a bug, it's one that was there before anyway. So I'm ready to check it in. (Marc, are you there?) regards, tom lane