Thread: Think we need major revisions in async.c...

Think we need major revisions in async.c...

From
Tom Lane
Date:
Well, I've spent the last several hours studying how NOTIFY works, and
I think it's a mess :-(.  We need some significant revisions in there.

I have not been able to trace an exact flow of control that would
yield the NOTIFY-message-interspersed-in-data result that I saw this
afternoon.  However, I am pretty certain that I know the basic cause
of the problem.  The risky aspect of NOTIFY is that very substantial
amounts of processing (an entire transaction, in fact) are launched from
a signal handler routine.  As a rule of thumb, doing more than setting a
flag variable in a signal handler is a big NO NO, because you don't have
any idea what bit of code you've interrupted or what you can safely
call.  (I have a strong suspicion, in fact, that the behavior I saw
arose from the signal interrupt occuring during the backend's fflush()
of data destined for the frontend at the end of a query cycle.  I dunno
how re-entrant the stdio library is on my machine, but it seems quite
possible that it behaves undesirably if one attempts to queue more
output data from a routine that interrupts fflush()...)

The Postgres code tries to protect itself against this sort of problem
by having the SIGUSR2 handler not do anything except set a flag if the
current transaction state says that a transaction is in progress.  This
is almost good enough, but evidently not quite.  The first issue is that
some query initialization and finalization code in PostgresMain happens
outside the StartTransactionCommand...CommitTransactionCommand zone of
safety.  (In particular, the final fflush of data bound for the
frontend.)  The second issue is that async.c itself sometimes starts and
ends transactions, by calling xact.c, and yet xact.c also calls async.c
as part of transaction commit.  At least two levels of recursion are
possible in there.  I have not convinced myself whether the bug I
encountered is due to some subtly incorrect ordering of events in that
interaction, or whether it's just due to non-reentrancy in stdio.
But I no longer care.  This code is demonstrably buggy, and it is too
complex to understand or maintain anyway.  I propose to rewrite it.

I propose that:

1. The link between inbound-notify processing and transaction
start/commit should be removed.  Having inbound-notify execution called
from transaction commit is too complicated because that notify execution
itself may have to induce a transaction start/commit.  Instead, inbound
notify execution ought to be triggered from the PostgresMain main loop.
This has the additional advantage that we can have more control over
just which parts of the main loop allow or disallow notify execution
directly from the signal handler.  The main loop should now contain
something like:

    for (;;) {

        ReadyForQuery();    // contains fflush() call

        AllowNotifyInterrupt();

        ReadCommand();

        DisableNotifyInterrupt();

        execute command, including transaction start/commit
    }

so that the SIGUSR2 interrupt is allowed to do something serious
only in the *very* tiny window where we are just waiting for or
reading the next frontend query.  (Since the interrupt has no
reason to touch the state of stdin, this should be safe enough.)
An interrupt occuring any other time will just set a flag, which
will be read by AllowNotifyInterrupt --- if an interrupt has occured
since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will
carry out the necessary processing.  Thus, all inbound-notify execution
effectively happens at the outer loop --- either while waiting for
frontend input, or just before doing so.


2. The innards of NOTIFY execution are unnecessarily complicated and
inefficient, quite aside from the above problem of controlling interrupt
safety.  The NOTIFY statement itself (routine Async_Notify) makes one
pass over the pg_listener relation, writing a 1 into the notification
field of each tuple with a matching relname.  Then at commit time, we
have to make another pass over the pg_listener relation to issue signals
(and/or send a message to our own front end, if we're NOTIFYing
something we also LISTEN on, which is pretty common in my applications
anyway).  This is unnecessary processing; worse, if the NOTIFY is done
early in a long transaction block, we end up holding a write lock on
pg_listen for a long time.  Since pg_listen is a global resource, this
can effectively bring all the rest of the backends to a halt!

The right way to do this is for NOTIFY to simply add the target relation
name to a list of pending notifies (as indeed it does already).  Then
when we are ready to commit, we scan pg_listen *once*, performing tuple
updates as needed and also sending signals and/or frontend messages.
(Note this is for an *outbound* notify, which we do want to have happen
during transaction commit.  The real problem with the current async.c is
that it tries to combine outbound and inbound notify code.  Keeping them
separate will be better.)  This approach takes less processing, and it
also means that we need not do anything so risky as
RelationUnsetLockForWrite() to gain performance.  The pg_listen lock
will be held for only a very short time before we release it as a normal
part of commit.  Moreover, a self-notify need never modify the pg_listen
table at all, saving cycles and opening the door for another feature
(see below).

If the transaction is aborted rather than committed, we need only
discard the list of pending notify names.  pg_listen is never touched
in that case.


3. While I am at it, I would like to add a feature that I've been
wanting for a while.  It seems to me that the be_pid field provided in a
PGnotify message ought not always be your own backend's PID --- what use
is that?  What it ought to be is the PID of the *signaling* backend, or
at least of one of the signaling backends other than your own.  I have
a bunch of applications that all read and write the same tables and let
each other know about updates by NOTIFY messages.  When one of these
apps gets a NOTIFY, it has to re-read the table *even if the notify was
the one it just issued*, and the updates in question are the ones it
just wrote out.  This is a waste of cycles, but right now there is no
choice because you cannot tell whether the NOTIFY you get is due
(solely) to your own NOTIFY or (perhaps in part) to someone else's.

What I'd like to do is to have cross-backend notifies write the
originating backend's PID into the "notification" field, rather than
just a "1".  The receiving backend can then send that value to its
frontend, as opposed to sending its own PID as it will do in the
self-notify case.  In this way the frontend can unambiguously
distinguish its own NOTIFY bouncing back from a NOTIFY from another
client.  (If several other clients send NOTIFY at about the same time,
you may get only one message with only one of their PIDs, since there
is only room for one originating PID in pg_listen.  But it will be kept
distinct from any message resulting from a NOTIFY of your own.)  Note
that we already made the backend's own PID available to the frontend as
part of the CANCEL-related changes, so that part of the info is there
for free.

A more radical solution would be to suppress self-notifies altogether,
but that is a change in semantics that would likely break some existing
apps.  I will be satisfied if the frontend can tell the difference
between a self-notify and an incoming notify.


Any objections?  If there's something crucial I've missed in all this,
better let me know ASAP!

I know that it's a tad late in the beta cycle to be making such significant
changes, but my applications really need NOTIFY to work and work reliably.
We have to have this *now*.  I hope y'all will bear with me.

            regards, tom lane

Re: [HACKERS] Think we need major revisions in async.c...

From
Bruce Momjian
Date:
> What I'd like to do is to have cross-backend notifies write the
> originating backend's PID into the "notification" field, rather than
> just a "1".  The receiving backend can then send that value to its
> frontend, as opposed to sending its own PID as it will do in the
> self-notify case.  In this way the frontend can unambiguously
> distinguish its own NOTIFY bouncing back from a NOTIFY from another
> client.  (If several other clients send NOTIFY at about the same time,
> you may get only one message with only one of their PIDs, since there
> is only room for one originating PID in pg_listen.  But it will be kept
> distinct from any message resulting from a NOTIFY of your own.)  Note
> that we already made the backend's own PID available to the frontend as
> part of the CANCEL-related changes, so that part of the info is there
> for free.
>
> A more radical solution would be to suppress self-notifies altogether,
> but that is a change in semantics that would likely break some existing
> apps.  I will be satisfied if the frontend can tell the difference
> between a self-notify and an incoming notify.
>
>
> Any objections?  If there's something crucial I've missed in all this,
> better let me know ASAP!
>
> I know that it's a tad late in the beta cycle to be making such significant
> changes, but my applications really need NOTIFY to work and work reliably.
> We have to have this *now*.  I hope y'all will bear with me.

Sounds good.  You have worked with the Notify code more than anyone
else, so go ahead.  You are fixing a bug, and that is fine at this
stage.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026


Re: [HACKERS] Think we need major revisions in async.c...

From
Massimo Dal Zotto
Date:
>
> Well, I've spent the last several hours studying how NOTIFY works, and
> I think it's a mess :-(.  We need some significant revisions in there.

I agreee. I tried to improve it but it's really a mess.

> I have not been able to trace an exact flow of control that would
> yield the NOTIFY-message-interspersed-in-data result that I saw this
> afternoon.  However, I am pretty certain that I know the basic cause
> of the problem.  The risky aspect of NOTIFY is that very substantial
> amounts of processing (an entire transaction, in fact) are launched from
> a signal handler routine.  As a rule of thumb, doing more than setting a
> flag variable in a signal handler is a big NO NO, because you don't have
> any idea what bit of code you've interrupted or what you can safely
> call.  (I have a strong suspicion, in fact, that the behavior I saw
> arose from the signal interrupt occuring during the backend's fflush()

I suspected this could happen but don't really know how signals are handled.

> of data destined for the frontend at the end of a query cycle.  I dunno
> how re-entrant the stdio library is on my machine, but it seems quite
> possible that it behaves undesirably if one attempts to queue more
> output data from a routine that interrupts fflush()...)
>
> The Postgres code tries to protect itself against this sort of problem
> by having the SIGUSR2 handler not do anything except set a flag if the
> current transaction state says that a transaction is in progress.  This
> is almost good enough, but evidently not quite.  The first issue is that
> some query initialization and finalization code in PostgresMain happens
> outside the StartTransactionCommand...CommitTransactionCommand zone of
> safety.  (In particular, the final fflush of data bound for the
> frontend.)  The second issue is that async.c itself sometimes starts and
> ends transactions, by calling xact.c, and yet xact.c also calls async.c
> as part of transaction commit.  At least two levels of recursion are
> possible in there.  I have not convinced myself whether the bug I

I believe that the transaction code can avoid this type of recursion.

> encountered is due to some subtly incorrect ordering of events in that
> interaction, or whether it's just due to non-reentrancy in stdio.
> But I no longer care.  This code is demonstrably buggy, and it is too
> complex to understand or maintain anyway.  I propose to rewrite it.
>
> I propose that:
>
> 1. The link between inbound-notify processing and transaction
> start/commit should be removed.  Having inbound-notify execution called
> from transaction commit is too complicated because that notify execution
> itself may have to induce a transaction start/commit.  Instead, inbound
> notify execution ought to be triggered from the PostgresMain main loop.

The reason why it is done at commit is that it is the best place where we
can know if the transaction completed or not. If we move the notify code
in the main loop we must also keep some flag which tell us if and when the
notify can be done.

> This has the additional advantage that we can have more control over
> just which parts of the main loop allow or disallow notify execution
> directly from the signal handler.  The main loop should now contain
> something like:
>
>     for (;;) {
>
>         ReadyForQuery();    // contains fflush() call
>
>         AllowNotifyInterrupt();
>
>         ReadCommand();
>
>         DisableNotifyInterrupt();
>
>         execute command, including transaction start/commit
>     }
>
> so that the SIGUSR2 interrupt is allowed to do something serious
> only in the *very* tiny window where we are just waiting for or
> reading the next frontend query.  (Since the interrupt has no
> reason to touch the state of stdin, this should be safe enough.)
> An interrupt occuring any other time will just set a flag, which
> will be read by AllowNotifyInterrupt --- if an interrupt has occured
> since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will
> carry out the necessary processing.  Thus, all inbound-notify execution
> effectively happens at the outer loop --- either while waiting for
> frontend input, or just before doing so.
>
>
> 2. The innards of NOTIFY execution are unnecessarily complicated and
> inefficient, quite aside from the above problem of controlling interrupt
> safety.  The NOTIFY statement itself (routine Async_Notify) makes one
> pass over the pg_listener relation, writing a 1 into the notification
> field of each tuple with a matching relname.  Then at commit time, we
> have to make another pass over the pg_listener relation to issue signals
> (and/or send a message to our own front end, if we're NOTIFYing
> something we also LISTEN on, which is pretty common in my applications
> anyway).  This is unnecessary processing; worse, if the NOTIFY is done
> early in a long transaction block, we end up holding a write lock on
> pg_listen for a long time.  Since pg_listen is a global resource, this
> can effectively bring all the rest of the backends to a halt!

This is the standard behaviour of the notify code. If you set the flag
notifyunlock in .../data/pg_options the write lock is released immediately
after updating the table, but I suspect this is the reason of the
duplicate oids. For this reason I left the flag disabled by default, and
maybe this explains why you don't get duplicate oids in 6.4, but I have
enabled it in a production environment without problems at all.

> The right way to do this is for NOTIFY to simply add the target relation
> name to a list of pending notifies (as indeed it does already).  Then
> when we are ready to commit, we scan pg_listen *once*, performing tuple
> updates as needed and also sending signals and/or frontend messages.
> (Note this is for an *outbound* notify, which we do want to have happen
> during transaction commit.  The real problem with the current async.c is
> that it tries to combine outbound and inbound notify code.  Keeping them

The real problem is that inbound notifiy can happen at any moment from the
signal handler.

> separate will be better.)  This approach takes less processing, and it
> also means that we need not do anything so risky as
> RelationUnsetLockForWrite() to gain performance.  The pg_listen lock
> will be held for only a very short time before we release it as a normal
> part of commit.  Moreover, a self-notify need never modify the pg_listen
> table at all, saving cycles and opening the door for another feature
> (see below).

But in general you must still get the write lock to prevent deadlocks.

> If the transaction is aborted rather than committed, we need only
> discard the list of pending notify names.  pg_listen is never touched
> in that case.

This sounds reasonable.

> 3. While I am at it, I would like to add a feature that I've been
> wanting for a while.  It seems to me that the be_pid field provided in a
> PGnotify message ought not always be your own backend's PID --- what use
> is that?  What it ought to be is the PID of the *signaling* backend, or
> at least of one of the signaling backends other than your own.  I have
> a bunch of applications that all read and write the same tables and let
> each other know about updates by NOTIFY messages.  When one of these
> apps gets a NOTIFY, it has to re-read the table *even if the notify was
> the one it just issued*, and the updates in question are the ones it
> just wrote out.  This is a waste of cycles, but right now there is no
> choice because you cannot tell whether the NOTIFY you get is due
> (solely) to your own NOTIFY or (perhaps in part) to someone else's.

The problem is that notify has the same semantic of unix signals. You can
have more notifies for the same table from many backends but you only know
that at least one has happened.
What you propose should be some sort of message queuing between backends
in which each message is notified separately to the other backend In this
case it would also be useful to pass some arbitrary text to the listening
backend.

> What I'd like to do is to have cross-backend notifies write the
> originating backend's PID into the "notification" field, rather than
> just a "1".  The receiving backend can then send that value to its
> frontend, as opposed to sending its own PID as it will do in the
> self-notify case.  In this way the frontend can unambiguously
> distinguish its own NOTIFY bouncing back from a NOTIFY from another
> client.  (If several other clients send NOTIFY at about the same time,
> you may get only one message with only one of their PIDs, since there
> is only room for one originating PID in pg_listen.  But it will be kept
> distinct from any message resulting from a NOTIFY of your own.)  Note
> that we already made the backend's own PID available to the frontend as
> part of the CANCEL-related changes, so that part of the info is there
> for free.
>
> A more radical solution would be to suppress self-notifies altogether,
> but that is a change in semantics that would likely break some existing
> apps.  I will be satisfied if the frontend can tell the difference
> between a self-notify and an incoming notify.

I wouldn't change this.

> Any objections?  If there's something crucial I've missed in all this,
> better let me know ASAP!
>
> I know that it's a tad late in the beta cycle to be making such significant
> changes, but my applications really need NOTIFY to work and work reliably.
> We have to have this *now*.  I hope y'all will bear with me.
>
>             regards, tom lane
>

Massimo Dal Zotto

+----------------------------------------------------------------------+
|  Massimo Dal Zotto                email:  dz@cs.unitn.it             |
|  Via Marconi, 141                 phone:  ++39-461-534251            |
|  38057 Pergine Valsugana (TN)     www:  http://www.cs.unitn.it/~dz/  |
|  Italy                            pgp:  finger dz@tango.cs.unitn.it  |
+----------------------------------------------------------------------+

Re: [HACKERS] Think we need major revisions in async.c...

From
Tom Lane
Date:
Massimo Dal Zotto <dz@cs.unitn.it> writes:
>> 1. The link between inbound-notify processing and transaction
>> start/commit should be removed.  Having inbound-notify execution called
>> from transaction commit is too complicated because that notify execution
>> itself may have to induce a transaction start/commit.  Instead, inbound
>> notify execution ought to be triggered from the PostgresMain main loop.

> The reason why it is done at commit is that it is the best place where we
> can know if the transaction completed or not. If we move the notify code
> in the main loop we must also keep some flag which tell us if and when the
> notify can be done.

No, you're confusing outbound and inbound notify.  Outbound notify ---
ie, actually doing something about a NOTIFY command issued by the user
--- should indeed be done from transaction commit.  If the transaction
is aborted, we want the NOTIFY to never take effect, just like all the
other side-effects of the transaction.  What I proposed for that was
that we could simplify matters by doing both the signal-sending and
pg_listener-table-updating work at the same time during transaction
commit, rather than changing the table and issuing the signals in
separate passes.

Inbound notify, ie, receiving messages from other backends, is a
different story.  There is no question of not accepting the signals;
they've already been committed by some other backend.  Now for
implementation reasons we need to hold off responding to an inbound
notify until we are outside of any transaction.  (Otherwise, our
attempt to mark the notify accepted by changing pg_listener could be
undone by a later transaction abort; not what we want.)  But that
doesn't mean the response processing has to be called from xact.c.
Better to do it at the main loop, with a check to see if we are inside
a transaction block or not.  Otherwise we have to depend on stdio
to be re-entrant.  Also, xact.c is already tricky enough without having
to worry about the exact order of operations in straightline code.
I spent some time on Saturday trying to see if the variables the signal
handler was looking at were being changed before the transaction was
quite complete --- and I'm still not sure whether it was really safe,
given the potential for mutual recursion between async.c and xact.c
in the old code.  Writing code that interacts correctly with an ISR is
tricky stuff, so you really want to do it in a place that doesn't have
much of anything else to worry about.

> This is the standard behaviour of the notify code. If you set the flag
> notifyunlock in .../data/pg_options the write lock is released immediately
> after updating the table, but I suspect this is the reason of the
> duplicate oids. For this reason I left the flag disabled by default, and
> maybe this explains why you don't get duplicate oids in 6.4, but I have
> enabled it in a production environment without problems at all.

Neither NotifyUnlock nor NotifyHack are necessary in the rewritten
code...

>> This is a waste of cycles, but right now there is no
>> choice because you cannot tell whether the NOTIFY you get is due
>> (solely) to your own NOTIFY or (perhaps in part) to someone else's.

> What you propose should be some sort of message queuing between backends
> in which each message is notified separately to the other backend In this
> case it would also be useful to pass some arbitrary text to the listening
> backend.

Not really.  For the applications I'm interested in, there are only two
interesting cases: a NOTIFY from some other client vs. one of your own
NOTIFYs bouncing back at you.  With the code as I now have it, you can
differentiate those cases.  You're right that a more general
message-passing facility might be useful.  On the other hand, one
can be built by users using an ordinary table and NOTIFY to wake up
sleeping respondents.  So I'm not sure we need to put more stuff in
the backend for that.

I have been testing this new code and it seems to work fine, except for
the question of interlocks against "select * from pg_listener" --- and
if that's a bug, it's one that was there before anyway.  So I'm ready
to check it in.  (Marc, are you there?)

            regards, tom lane