Think we need major revisions in async.c... - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Think we need major revisions in async.c... |
Date | |
Msg-id | 2520.906858313@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] Think we need major revisions in async.c...
Re: [HACKERS] Think we need major revisions in async.c... |
List | pgsql-hackers |
Well, I've spent the last several hours studying how NOTIFY works, and I think it's a mess :-(. We need some significant revisions in there. I have not been able to trace an exact flow of control that would yield the NOTIFY-message-interspersed-in-data result that I saw this afternoon. However, I am pretty certain that I know the basic cause of the problem. The risky aspect of NOTIFY is that very substantial amounts of processing (an entire transaction, in fact) are launched from a signal handler routine. As a rule of thumb, doing more than setting a flag variable in a signal handler is a big NO NO, because you don't have any idea what bit of code you've interrupted or what you can safely call. (I have a strong suspicion, in fact, that the behavior I saw arose from the signal interrupt occuring during the backend's fflush() of data destined for the frontend at the end of a query cycle. I dunno how re-entrant the stdio library is on my machine, but it seems quite possible that it behaves undesirably if one attempts to queue more output data from a routine that interrupts fflush()...) The Postgres code tries to protect itself against this sort of problem by having the SIGUSR2 handler not do anything except set a flag if the current transaction state says that a transaction is in progress. This is almost good enough, but evidently not quite. The first issue is that some query initialization and finalization code in PostgresMain happens outside the StartTransactionCommand...CommitTransactionCommand zone of safety. (In particular, the final fflush of data bound for the frontend.) The second issue is that async.c itself sometimes starts and ends transactions, by calling xact.c, and yet xact.c also calls async.c as part of transaction commit. At least two levels of recursion are possible in there. I have not convinced myself whether the bug I encountered is due to some subtly incorrect ordering of events in that interaction, or whether it's just due to non-reentrancy in stdio. But I no longer care. This code is demonstrably buggy, and it is too complex to understand or maintain anyway. I propose to rewrite it. I propose that: 1. The link between inbound-notify processing and transaction start/commit should be removed. Having inbound-notify execution called from transaction commit is too complicated because that notify execution itself may have to induce a transaction start/commit. Instead, inbound notify execution ought to be triggered from the PostgresMain main loop. This has the additional advantage that we can have more control over just which parts of the main loop allow or disallow notify execution directly from the signal handler. The main loop should now contain something like: for (;;) { ReadyForQuery(); // contains fflush() call AllowNotifyInterrupt(); ReadCommand(); DisableNotifyInterrupt(); execute command, including transaction start/commit } so that the SIGUSR2 interrupt is allowed to do something serious only in the *very* tiny window where we are just waiting for or reading the next frontend query. (Since the interrupt has no reason to touch the state of stdin, this should be safe enough.) An interrupt occuring any other time will just set a flag, which will be read by AllowNotifyInterrupt --- if an interrupt has occured since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will carry out the necessary processing. Thus, all inbound-notify execution effectively happens at the outer loop --- either while waiting for frontend input, or just before doing so. 2. The innards of NOTIFY execution are unnecessarily complicated and inefficient, quite aside from the above problem of controlling interrupt safety. The NOTIFY statement itself (routine Async_Notify) makes one pass over the pg_listener relation, writing a 1 into the notification field of each tuple with a matching relname. Then at commit time, we have to make another pass over the pg_listener relation to issue signals (and/or send a message to our own front end, if we're NOTIFYing something we also LISTEN on, which is pretty common in my applications anyway). This is unnecessary processing; worse, if the NOTIFY is done early in a long transaction block, we end up holding a write lock on pg_listen for a long time. Since pg_listen is a global resource, this can effectively bring all the rest of the backends to a halt! The right way to do this is for NOTIFY to simply add the target relation name to a list of pending notifies (as indeed it does already). Then when we are ready to commit, we scan pg_listen *once*, performing tuple updates as needed and also sending signals and/or frontend messages. (Note this is for an *outbound* notify, which we do want to have happen during transaction commit. The real problem with the current async.c is that it tries to combine outbound and inbound notify code. Keeping them separate will be better.) This approach takes less processing, and it also means that we need not do anything so risky as RelationUnsetLockForWrite() to gain performance. The pg_listen lock will be held for only a very short time before we release it as a normal part of commit. Moreover, a self-notify need never modify the pg_listen table at all, saving cycles and opening the door for another feature (see below). If the transaction is aborted rather than committed, we need only discard the list of pending notify names. pg_listen is never touched in that case. 3. While I am at it, I would like to add a feature that I've been wanting for a while. It seems to me that the be_pid field provided in a PGnotify message ought not always be your own backend's PID --- what use is that? What it ought to be is the PID of the *signaling* backend, or at least of one of the signaling backends other than your own. I have a bunch of applications that all read and write the same tables and let each other know about updates by NOTIFY messages. When one of these apps gets a NOTIFY, it has to re-read the table *even if the notify was the one it just issued*, and the updates in question are the ones it just wrote out. This is a waste of cycles, but right now there is no choice because you cannot tell whether the NOTIFY you get is due (solely) to your own NOTIFY or (perhaps in part) to someone else's. What I'd like to do is to have cross-backend notifies write the originating backend's PID into the "notification" field, rather than just a "1". The receiving backend can then send that value to its frontend, as opposed to sending its own PID as it will do in the self-notify case. In this way the frontend can unambiguously distinguish its own NOTIFY bouncing back from a NOTIFY from another client. (If several other clients send NOTIFY at about the same time, you may get only one message with only one of their PIDs, since there is only room for one originating PID in pg_listen. But it will be kept distinct from any message resulting from a NOTIFY of your own.) Note that we already made the backend's own PID available to the frontend as part of the CANCEL-related changes, so that part of the info is there for free. A more radical solution would be to suppress self-notifies altogether, but that is a change in semantics that would likely break some existing apps. I will be satisfied if the frontend can tell the difference between a self-notify and an incoming notify. Any objections? If there's something crucial I've missed in all this, better let me know ASAP! I know that it's a tad late in the beta cycle to be making such significant changes, but my applications really need NOTIFY to work and work reliably. We have to have this *now*. I hope y'all will bear with me. regards, tom lane
pgsql-hackers by date: