Think we need major revisions in async.c... - Mailing list pgsql-hackers

From Tom Lane
Subject Think we need major revisions in async.c...
Date
Msg-id 2520.906858313@sss.pgh.pa.us
Whole thread Raw
Responses Re: [HACKERS] Think we need major revisions in async.c...
Re: [HACKERS] Think we need major revisions in async.c...
List pgsql-hackers
Well, I've spent the last several hours studying how NOTIFY works, and
I think it's a mess :-(.  We need some significant revisions in there.

I have not been able to trace an exact flow of control that would
yield the NOTIFY-message-interspersed-in-data result that I saw this
afternoon.  However, I am pretty certain that I know the basic cause
of the problem.  The risky aspect of NOTIFY is that very substantial
amounts of processing (an entire transaction, in fact) are launched from
a signal handler routine.  As a rule of thumb, doing more than setting a
flag variable in a signal handler is a big NO NO, because you don't have
any idea what bit of code you've interrupted or what you can safely
call.  (I have a strong suspicion, in fact, that the behavior I saw
arose from the signal interrupt occuring during the backend's fflush()
of data destined for the frontend at the end of a query cycle.  I dunno
how re-entrant the stdio library is on my machine, but it seems quite
possible that it behaves undesirably if one attempts to queue more
output data from a routine that interrupts fflush()...)

The Postgres code tries to protect itself against this sort of problem
by having the SIGUSR2 handler not do anything except set a flag if the
current transaction state says that a transaction is in progress.  This
is almost good enough, but evidently not quite.  The first issue is that
some query initialization and finalization code in PostgresMain happens
outside the StartTransactionCommand...CommitTransactionCommand zone of
safety.  (In particular, the final fflush of data bound for the
frontend.)  The second issue is that async.c itself sometimes starts and
ends transactions, by calling xact.c, and yet xact.c also calls async.c
as part of transaction commit.  At least two levels of recursion are
possible in there.  I have not convinced myself whether the bug I
encountered is due to some subtly incorrect ordering of events in that
interaction, or whether it's just due to non-reentrancy in stdio.
But I no longer care.  This code is demonstrably buggy, and it is too
complex to understand or maintain anyway.  I propose to rewrite it.

I propose that:

1. The link between inbound-notify processing and transaction
start/commit should be removed.  Having inbound-notify execution called
from transaction commit is too complicated because that notify execution
itself may have to induce a transaction start/commit.  Instead, inbound
notify execution ought to be triggered from the PostgresMain main loop.
This has the additional advantage that we can have more control over
just which parts of the main loop allow or disallow notify execution
directly from the signal handler.  The main loop should now contain
something like:

    for (;;) {

        ReadyForQuery();    // contains fflush() call

        AllowNotifyInterrupt();

        ReadCommand();

        DisableNotifyInterrupt();

        execute command, including transaction start/commit
    }

so that the SIGUSR2 interrupt is allowed to do something serious
only in the *very* tiny window where we are just waiting for or
reading the next frontend query.  (Since the interrupt has no
reason to touch the state of stdin, this should be safe enough.)
An interrupt occuring any other time will just set a flag, which
will be read by AllowNotifyInterrupt --- if an interrupt has occured
since the last DisableNotifyInterrupt, then AllowNotifyInterrupt will
carry out the necessary processing.  Thus, all inbound-notify execution
effectively happens at the outer loop --- either while waiting for
frontend input, or just before doing so.


2. The innards of NOTIFY execution are unnecessarily complicated and
inefficient, quite aside from the above problem of controlling interrupt
safety.  The NOTIFY statement itself (routine Async_Notify) makes one
pass over the pg_listener relation, writing a 1 into the notification
field of each tuple with a matching relname.  Then at commit time, we
have to make another pass over the pg_listener relation to issue signals
(and/or send a message to our own front end, if we're NOTIFYing
something we also LISTEN on, which is pretty common in my applications
anyway).  This is unnecessary processing; worse, if the NOTIFY is done
early in a long transaction block, we end up holding a write lock on
pg_listen for a long time.  Since pg_listen is a global resource, this
can effectively bring all the rest of the backends to a halt!

The right way to do this is for NOTIFY to simply add the target relation
name to a list of pending notifies (as indeed it does already).  Then
when we are ready to commit, we scan pg_listen *once*, performing tuple
updates as needed and also sending signals and/or frontend messages.
(Note this is for an *outbound* notify, which we do want to have happen
during transaction commit.  The real problem with the current async.c is
that it tries to combine outbound and inbound notify code.  Keeping them
separate will be better.)  This approach takes less processing, and it
also means that we need not do anything so risky as
RelationUnsetLockForWrite() to gain performance.  The pg_listen lock
will be held for only a very short time before we release it as a normal
part of commit.  Moreover, a self-notify need never modify the pg_listen
table at all, saving cycles and opening the door for another feature
(see below).

If the transaction is aborted rather than committed, we need only
discard the list of pending notify names.  pg_listen is never touched
in that case.


3. While I am at it, I would like to add a feature that I've been
wanting for a while.  It seems to me that the be_pid field provided in a
PGnotify message ought not always be your own backend's PID --- what use
is that?  What it ought to be is the PID of the *signaling* backend, or
at least of one of the signaling backends other than your own.  I have
a bunch of applications that all read and write the same tables and let
each other know about updates by NOTIFY messages.  When one of these
apps gets a NOTIFY, it has to re-read the table *even if the notify was
the one it just issued*, and the updates in question are the ones it
just wrote out.  This is a waste of cycles, but right now there is no
choice because you cannot tell whether the NOTIFY you get is due
(solely) to your own NOTIFY or (perhaps in part) to someone else's.

What I'd like to do is to have cross-backend notifies write the
originating backend's PID into the "notification" field, rather than
just a "1".  The receiving backend can then send that value to its
frontend, as opposed to sending its own PID as it will do in the
self-notify case.  In this way the frontend can unambiguously
distinguish its own NOTIFY bouncing back from a NOTIFY from another
client.  (If several other clients send NOTIFY at about the same time,
you may get only one message with only one of their PIDs, since there
is only room for one originating PID in pg_listen.  But it will be kept
distinct from any message resulting from a NOTIFY of your own.)  Note
that we already made the backend's own PID available to the frontend as
part of the CANCEL-related changes, so that part of the info is there
for free.

A more radical solution would be to suppress self-notifies altogether,
but that is a change in semantics that would likely break some existing
apps.  I will be satisfied if the frontend can tell the difference
between a self-notify and an incoming notify.


Any objections?  If there's something crucial I've missed in all this,
better let me know ASAP!

I know that it's a tad late in the beta cycle to be making such significant
changes, but my applications really need NOTIFY to work and work reliably.
We have to have this *now*.  I hope y'all will bear with me.

            regards, tom lane

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] NOTIFY interlock broken (was Yipes, I'm getting bit by duplicate tuples)
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Think we need major revisions in async.c...