Thread: LISTEN/NOTIFY race condition?

LISTEN/NOTIFY race condition?

From
Laurent Birtz
Date:
Hello,

I am wondering if I have encountered a race condition in Postgres that
sometimes cause a notification not to be sent. This post is rather long
since the situation is rather complex and I tried to provide as much
details as possible.


Please note:

I have modified the file src/backend/commands/async.c to make the logging
more verbose. I have not otherwise modified the code. I am using
Postgres 8.3 on Debian/Linux. The sources of Postgres I am using were
taken from Debian's packages (unstable repos). Debian patches were
applied.


Organization of my setup:

I am running a multithreaded daemon that uses Postgres. This daemon has
two threads of interest: the command thread and the event thread. Each
thread uses a dedicated connection to the Postgres database, i.e. there
are two independent connections to the database. Both connections are
asynchronous, i.e. I have called PQsetnonblocking() on both connections
to make them non-blocking.

The command thread inserts data (which I call "events") in a table called
the "event log". The event thread polls the event log table from time to
time to retrieve and process the new events. The command thread never
read data from the event log and the event thread  never write data to
the event log.

I use the LISTEN/NOTIFY mechanism to wake up the event thread when a
new event is written in the event log table. An AFTER-INSERT trigger is
created on the event log table so that any insertion in that table sends
a notification to the connections listening on that table, i.e. the event
thread.

The processing done by the event thread is as follow:

1) Connect to the database.
2) Do a LISTEN operation on the event log.
3) Do a SELECT operation on the event log to obtain all the events
   inserted so far.
4) Enter the "wait-for-notification" loop. When a notification is
   received, redo a SELECT operation to obtain the newly posted events.

The problem I am encountering is that sometimes, when an event is posted
between the steps 2) and 3) above, the following situation occurs:

a) No notification about this event is ever sent to the event thread.
b) The SELECT operation of step 3) does NOT retrieve the posted event.

Hence, the event thread does not retrieve the posted event until a
notification is received because another event was posted. The problem
occurs about 1 time of 100. All attempts to introduce sleep() calls
in the code of the daemon to make the problem more reproductible prevent
the problem from happening (heisenbug).

The following trace presents the steps being done until the problem
is encountered. The output sent to the daemon by Postgres is intermixed
with the output of the daemon itself. I provided comments to describe
what is going on. Clearly, what is shown here is the OBSERVED sequence
of events; the real chronological order can only be guessed at.


// The command thread begins a transaction.
Command thread: executing |start transaction|.
Command thread: executing |create sequence kws_961_event_id_seq|.

// The command thread creates the event table.
Command thread: executing
   |create table kws_961_event_log (
    id int primary key default nextval('kws_961_event_id_seq'),
    type int,
    event bytea)|.

NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
         "kws_961_event_log_pkey" for table "kws_961_event_log"

Command thread: executing
   |create or replace function notify_log_961() returns trigger as '
    begin if TG_OP = ''INSERT'' then notify kws_961_event_log;
    end if; return NEW; end' language plpgsql|.

Command thread: executing
   |create trigger notify_log_961 after insert on kws_961_event_log
    execute procedure notify_log_961()|.

// The command thread posts an event in the event log.
Command thread: executing |INSERT INTO kws_961_event_log (bla bla bla...

DEBUG:  Async_Notify(kws_961_event_log)
CONTEXT:  SQL statement "notify kws_961_event_log"
PL/pgSQL function "notify_log_961" line 1 at SQL statement

// The command thread commits the transaction.
Command thread: executing |commit|.

DEBUG:  AtCommit_Notify
DEBUG:  Pending notification for relation kws_961_event_log. // My msg.
DEBUG:  AtCommit_Notify: done


// NOT SHOWN: the command thread posts another event in the event log.


// The listen thread now wants to listen to the event log. At this point
// there are two events in the event log. NOTE: the event thread SENDS
// the query to Postgres, but does not RECEIVE the result yet (see
// below).
Event thread: executing |listen kws_961_event_log|.

DEBUG:  Async_Listen(kws_961_event_log,17748)
DEBUG:  Async_Listen(kws_961_event_log,17748): got exclusive lock on
        listener table
DEBUG:  Async_Listen(kws_961_event_log,17748): inserting new tuple
DEBUG:  Async_Listen(kws_961_event_log,17748): about to release exclusive
        lock
DEBUG:  Async_Listen(kws_961_event_log,17748): released exclusive lock


// Meanwhile, the command thread posts the third event, in a transaction.
Command thread: executing |start transaction|.
Command thread: executing |INSERT INTO kws_961_event_log (bla bla bla

DEBUG:  Async_Notify(kws_961_event_log)
CONTEXT:  SQL statement "notify kws_961_event_log"
PL/pgSQL function "notify_log_961" line 1 at SQL statement

// The command thread commits the transaction.
Command thread: executing |commit|.
DEBUG:  AtCommit_Notify

// Note: this message occurs because I printed the content of
// 'static List *pendingNotifies = NIL;' in async.c.
DEBUG:  Pending notification for relation kws_961_event_log.

// Notice that NO notification is actually delivered.
DEBUG:  AtCommit_Notify: done


// The event thread receives the result of the LISTEN operation.
Event thread: result for listen to workspace 961.

// The event thread polls the event log.
Event thread: |select id, type, event from kws_961_event_log
               where id > 0|.

// The event thread receives the poll results. Notice that only two
// events are actually retrieved.
Event thread: fetched 2 events for workspace 961.
Event thread: got event ID 1 type 402784512.
Event thread: got event ID 2 type 402785024.

// No subsequent notification is received, even though PQconsumeInput()
// and PQnotifies() are called subsequently and repeatedly. A
// notification is only received when another event is posted.


 From the observed sequence of events, it would appear that the event
thread inserts a tuple in the pg_listener table BEFORE the command thread
actually commits the transaction. However, when this transaction commits,
Postgres does not actually find the event thread's tuple in this table.
Consequently, one has to assume that the tuple in the pg_listener table
is inserted AFTER the AtCommit_Notify() function has executed.

But, when the event thread later polls the event log, it does not
actually retrieve the third event. This suggests that the transaction has
not yet been commited.

I have three hypotheseses at this point:

1) The pg_listener table is desynchronized.

2) The function AtCommit_Notify() does not hold the lock on the
   pg_listener table until the transaction is fully committed, so the
   event thread has the time to listen and poll the event log in that
   small window of time.

3) I screwed up somewhere (most likely hypothesis :-).


Can somebody help me figure out what is going on? That would be much
appreciated!

Thanks in advance,
Laurent Birtz

Re: LISTEN/NOTIFY race condition?

From
Tom Lane
Date:
Laurent Birtz <laurent.birtz@kryptiva.com> writes:
>  From the observed sequence of events, it would appear that the event
> thread inserts a tuple in the pg_listener table BEFORE the command thread
> actually commits the transaction. However, when this transaction commits,
> Postgres does not actually find the event thread's tuple in this table.

I'm wondering exactly when you commit the LISTEN?  The command thread
can't see the pg_listener tuple until it's committed ...

            regards, tom lane

Re: LISTEN/NOTIFY race condition?

From
Laurent Birtz
Date:
Tom Lane wrote:
> Laurent Birtz <laurent.birtz@kryptiva.com> writes:
>
>>  From the observed sequence of events, it would appear that the event
>> thread inserts a tuple in the pg_listener table BEFORE the command thread
>> actually commits the transaction. However, when this transaction commits,
>> Postgres does not actually find the event thread's tuple in this table.
>>
>
> I'm wondering exactly when you commit the LISTEN?  The command thread
> can't see the pg_listener tuple until it's committed ...
>
I'm executing the LISTEN statement outside a transaction block.
I presume Postgres implicitly adds the BEGIN and COMMIT
statements in that case?

Just to be clear, the SELECT statement that retrieve new events is
also executed outside a transaction block.

Thanks for the help,
Laurent Birtz

Re: LISTEN/NOTIFY race condition?

From
Laurent Birtz
Date:
Hello,

I've figured out the LISTEN / NOTIFY race condition I had previously
encountered. It is not trivial, so I'll try to give as much details as
possible to describe what is going on.


Here is a description of the two Postgres functions involved in the race
condition:

void CommitTransaction()
{
  ...

  /* NOTIFY commit must come before lower-level cleanup */
  AtCommit_Notify();

  ...

  /*
   * Here is where we really truly commit.
   */
  latestXid = RecordTransactionCommit();

  ...

  /*
   * This is all post-commit cleanup.  Note that if an error is raised here,
   * it's too late to abort the transaction.  This should be just
   * noncritical resource releasing.
   *
   * The ordering of operations is not entirely random.  The idea is:
   * release resources visible to other backends (eg, files, buffer pins);
   * then release locks; then release backend-local resources. We want to
   * release locks at the point where any backend waiting for us will see
   * our transaction as being fully cleaned up.
   *
   * Resources that can be associated with individual queries are handled by
   * the ResourceOwner mechanism.  The other calls here are for backend-wide
   * state.
   */
  ResourceOwnerRelease(TopTransactionResourceOwner,
                       RESOURCE_RELEASE_LOCKS,
                       true, true);
  ...
}

void Async_Listen(const char *relname)
{
  ...

  lRel = heap_open(ListenerRelationId, ExclusiveLock);

  ...

  /*
   * OK to insert a new tuple
   */

  simple_heap_insert(lRel, tuple);

  ...

  heap_close(lRel, ExclusiveLock);

  ...
}

In summary, CommitTransaction() notifies FIRST and commits AFTER.
Async_Listen() releases its lock BEFORE the end of its transaction.


I've written a program that triggers the bug quickly. Here is what this
program do:

1) The program deletes every row of the table 'the_log'. This table
   contains an AFTER-INSERT trigger that notifies processes listening on
   it.

2) The program starts a command thread and an event thread concurrently.

3) The command thread executes the following statements:
   a) BEGIN TRANSACTION
   b) INSERT INTO the_log ...
   c) COMMIT

4) The event thread executes 'LISTEN the_log'.

5) The event thread executes 'SELECT * FROM the_log'.

6) If the event thread found the event, we clean up and go back to 1).

7) The event thread waits for a notification. If the notification is
   received, we clean up and go back to 1).

8) We reproduced the bug.


I've instrumented the Postgres source to show what is going on in the
functions of interest. I print both the executing process PID and the
current time in microseconds. My machine has only one CPU. Hence, I'm
pretty sure the following trace accurately reflects what is happening
when my program is executing.

Note:
      PID 29295 is the backend for the command thread.
      PID 29296 is the backend for the event thread.
      Blank lines correspond to context switches.

Command thread: executing |start transaction|.

Event thread: executing |LISTEN the_log|.

// The command backend is handling 'start transaction'.
DEBUG:  [PID 29295] [Time 497310]:
    BeginTransactionBlock() called for BEGIN

Command thread: executing |insert into the_log...|.

// The event backend is doing the listen.
DEBUG:  [PID 29296] [Time 499744]: StartTransaction() called
DEBUG:  Async_Listen(the_log,29296)
DEBUG:  [PID 29296] [Time 501046]: CommitTransaction() called
DEBUG:  [PID 29296] [Time 501163]: RecordTransactionCommit() called

// The command backend is handling 'insert'.
DEBUG:  Async_Notify(the_log)
DEBUG:  [PID 29295] [Time 502624]:
    CommitTransactionCommand: incrementing command counter

Command thread: executing |commit|.

// The command backend is handling 'commit'.
DEBUG:  [PID 29295] [Time 503192]: CommitTransaction() called
DEBUG:  [PID 29295] AtCommit_Notify
DEBUG:  [PID 29295] AtCommit_Notify: done
DEBUG:  [PID 29295] [Time 507981]: RecordTransactionCommit() called

// The event backend is still handling 'listen'.
DEBUG:  [PID 29296] [Time 508319]:
    CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS.

Event thread: executing |select * from the_log|.

// The event backend is handling 'select'.
DEBUG:  [PID 29296] [Time 508754]: StartTransaction() called
DEBUG:  [PID 29296] [Time 510082]: CommitTransaction() called
DEBUG:  [PID 29296] [Time 510207]: RecordTransactionCommit() called
DEBUG:  [PID 29296] [Time 510326]:
    CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS.

// The command thread is still handling 'commit'.
DEBUG:  [PID 29295] [Time 510690]:
    CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS.


Analysis:

The command backend begins a transaction, then the event backend begins a
transaction for LISTEN. The event backend updates the pg_listener table
in mutual exclusion, but it releases the mutex before the transaction is
commited. The command thread does its INSERT then starts its COMMIT. The
command backend scans the pg_listener table and doesn't find anyone to
notify. Before the command backend has the time to actually commit, the
event backend commits and the listen thread has the time to execute the
SELECT statement. Since the command thread didn't commit yet, the SELECT
returns nothing. Finally, the command thread commits, without notifying
anyone.

Essentially, the event thread won't receive a notification for the events
commited when it executes the LISTEN statement, and it won't necessarily
see them either if it executes a SELECT statement subsequently. I believe
this greatly limits the usefulness of the LISTEN / NOTIFY mechanism.


Solutions:

There are two solutions I developed:

1) Don't use an AFTER-INSERT trigger. Execute NOTIFY in its own
   top-level transaction.

  Cons:
  - Bad for performance.
  - Annoying kludge.
  - It's flawed. If the command process crashes before it
    commits the NOTIFY, the event won't be noticed.

2) Fix the race condition in Postgres.

   In Async_Listen(): change
   'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'.

   I've tested this solution. It seems to work and I believe this is the
   best solution.


I would appreciate receiving feedback on this issue.

Thanks for your time,
Laurent Birtz

Re: LISTEN/NOTIFY race condition?

From
Tom Lane
Date:
Laurent Birtz <laurent.birtz@kryptiva.com> writes:
> The command backend begins a transaction, then the event backend begins a
> transaction for LISTEN. The event backend updates the pg_listener table
> in mutual exclusion, but it releases the mutex before the transaction is
> commited. The command thread does its INSERT then starts its COMMIT. The
> command backend scans the pg_listener table and doesn't find anyone to
> notify. Before the command backend has the time to actually commit, the
> event backend commits and the listen thread has the time to execute the
> SELECT statement. Since the command thread didn't commit yet, the SELECT
> returns nothing. Finally, the command thread commits, without notifying
> anyone.

Hmm ... yup, that's a race condition all right, and of sufficiently low
probability that it's not surprising it's gone undetected for so long.

>    In Async_Listen(): change
>    'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'.

This solution is pretty ugly, though, because we allow people to
execute LISTEN/UNLISTEN in transaction blocks, which means that the
ExclusiveLock could be held for quite some time.  Not only is that bad
for performance but it poses significant risks of deadlocks.

What I am thinking is that we need to change LISTEN/UNLISTEN so that
they don't attempt to modify the pg_listener table until COMMIT time.
It would go about like this:

1. LISTEN and UNLISTEN would make entries in a transaction-lifespan
list, much as NOTIFY does, with suitable logic to cancel each others'
effects on the list as needed.  They don't touch pg_listener at all.

2. If we abort the transaction then the list of pending actions is just
thrown away.  (Hmm, we'd also need to account for subtransactions...)

3. If we commit, then AtCommit_Notify is responsible for applying any
pg_listener insertions and deletions indicated by the pending-actions
list, after it grabs ExclusiveLock and before it starts the notify loop.
(A CommandCounterIncrement between this step and the notify loop will
assure that LISTEN and NOTIFY in the same transaction result in a
self-notify just as before.)

Since AtCommit_Notify doesn't release the lock, the race condition is
fixed, and since it only happens immediately before commit, there is no
added risk of deadlock.

The only externally visible difference in behavior is that it's less
likely for failed LISTENing transactions to leave dead tuples in
pg_listener.  There is an *internally* visible difference, in that
a sequence like

    BEGIN;
    LISTEN foo;
    SELECT ... FROM pg_listener ...
    COMMIT;

will fail to see any tuple from the LISTEN in pg_listener, where
historically it did.  I suppose it's conceivable that some application
out there depends on this, but it doesn't seem very likely.  (Slony guys
want to speak up here?)

Comments?  Have I missed anything?

Looking at this sketch, my first reaction is that it's a lot of code
to write in support of an implementation we hope to throw away soon.
But my second reaction is that we're going to need most of that code
anyway in a pg_listener-less implementation, because the module will
have to emulate the current transactional behavior of LISTEN/UNLISTEN
without any table to store rows in.  So there should be something
salvageable from the effort.

Assuming that we implement this (I'm willing to write the code),
is it sane to back-patch such a non-trivial change?  It's a bit
scary but on the other hand we've back-patched larger changes when
we had to.  Leaving the race condition in place isn't appetizing,
and neither is introducing deadlock risks that weren't there before.

            regards, tom lane

Re: LISTEN/NOTIFY race condition?

From
Laurent Birtz
Date:
>>    In Async_Listen(): change
>>    'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'.
>>
>
> This solution is pretty ugly, though, because we allow people to
> execute LISTEN/UNLISTEN in transaction blocks, which means that the
> ExclusiveLock could be held for quite some time.  Not only is that bad
> for performance but it poses significant risks of deadlocks.
>

True, I had not considered this.

> Comments?  Have I missed anything?
>

As far as I can tell it seems fine.


Thanks a lot for your input!
Laurent Birtz

Re: LISTEN/NOTIFY race condition?

From
Laurent Birtz
Date:
Tom Lane wrote:
> This patch will apply against HEAD and 8.3, but not cleanly against
> prior versions.  Since this code hasn't changed materially (except for
> additions of subtransaction and 2PC support) for a long time,
> back-patching should be straightforward, but I haven't actually done
> that yet.
>
> Comments?
>

I've reviewed the patch briefly and tested it against my version
of Postgres (8.3/Debian) and the test program I wrote to
reproduce the bug. I confirm the test program works fine.

Thanks for the quick fix!
Laurent Birtz