Thread: LISTEN/NOTIFY race condition?
Hello, I am wondering if I have encountered a race condition in Postgres that sometimes cause a notification not to be sent. This post is rather long since the situation is rather complex and I tried to provide as much details as possible. Please note: I have modified the file src/backend/commands/async.c to make the logging more verbose. I have not otherwise modified the code. I am using Postgres 8.3 on Debian/Linux. The sources of Postgres I am using were taken from Debian's packages (unstable repos). Debian patches were applied. Organization of my setup: I am running a multithreaded daemon that uses Postgres. This daemon has two threads of interest: the command thread and the event thread. Each thread uses a dedicated connection to the Postgres database, i.e. there are two independent connections to the database. Both connections are asynchronous, i.e. I have called PQsetnonblocking() on both connections to make them non-blocking. The command thread inserts data (which I call "events") in a table called the "event log". The event thread polls the event log table from time to time to retrieve and process the new events. The command thread never read data from the event log and the event thread never write data to the event log. I use the LISTEN/NOTIFY mechanism to wake up the event thread when a new event is written in the event log table. An AFTER-INSERT trigger is created on the event log table so that any insertion in that table sends a notification to the connections listening on that table, i.e. the event thread. The processing done by the event thread is as follow: 1) Connect to the database. 2) Do a LISTEN operation on the event log. 3) Do a SELECT operation on the event log to obtain all the events inserted so far. 4) Enter the "wait-for-notification" loop. When a notification is received, redo a SELECT operation to obtain the newly posted events. The problem I am encountering is that sometimes, when an event is posted between the steps 2) and 3) above, the following situation occurs: a) No notification about this event is ever sent to the event thread. b) The SELECT operation of step 3) does NOT retrieve the posted event. Hence, the event thread does not retrieve the posted event until a notification is received because another event was posted. The problem occurs about 1 time of 100. All attempts to introduce sleep() calls in the code of the daemon to make the problem more reproductible prevent the problem from happening (heisenbug). The following trace presents the steps being done until the problem is encountered. The output sent to the daemon by Postgres is intermixed with the output of the daemon itself. I provided comments to describe what is going on. Clearly, what is shown here is the OBSERVED sequence of events; the real chronological order can only be guessed at. // The command thread begins a transaction. Command thread: executing |start transaction|. Command thread: executing |create sequence kws_961_event_id_seq|. // The command thread creates the event table. Command thread: executing |create table kws_961_event_log ( id int primary key default nextval('kws_961_event_id_seq'), type int, event bytea)|. NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "kws_961_event_log_pkey" for table "kws_961_event_log" Command thread: executing |create or replace function notify_log_961() returns trigger as ' begin if TG_OP = ''INSERT'' then notify kws_961_event_log; end if; return NEW; end' language plpgsql|. Command thread: executing |create trigger notify_log_961 after insert on kws_961_event_log execute procedure notify_log_961()|. // The command thread posts an event in the event log. Command thread: executing |INSERT INTO kws_961_event_log (bla bla bla... DEBUG: Async_Notify(kws_961_event_log) CONTEXT: SQL statement "notify kws_961_event_log" PL/pgSQL function "notify_log_961" line 1 at SQL statement // The command thread commits the transaction. Command thread: executing |commit|. DEBUG: AtCommit_Notify DEBUG: Pending notification for relation kws_961_event_log. // My msg. DEBUG: AtCommit_Notify: done // NOT SHOWN: the command thread posts another event in the event log. // The listen thread now wants to listen to the event log. At this point // there are two events in the event log. NOTE: the event thread SENDS // the query to Postgres, but does not RECEIVE the result yet (see // below). Event thread: executing |listen kws_961_event_log|. DEBUG: Async_Listen(kws_961_event_log,17748) DEBUG: Async_Listen(kws_961_event_log,17748): got exclusive lock on listener table DEBUG: Async_Listen(kws_961_event_log,17748): inserting new tuple DEBUG: Async_Listen(kws_961_event_log,17748): about to release exclusive lock DEBUG: Async_Listen(kws_961_event_log,17748): released exclusive lock // Meanwhile, the command thread posts the third event, in a transaction. Command thread: executing |start transaction|. Command thread: executing |INSERT INTO kws_961_event_log (bla bla bla DEBUG: Async_Notify(kws_961_event_log) CONTEXT: SQL statement "notify kws_961_event_log" PL/pgSQL function "notify_log_961" line 1 at SQL statement // The command thread commits the transaction. Command thread: executing |commit|. DEBUG: AtCommit_Notify // Note: this message occurs because I printed the content of // 'static List *pendingNotifies = NIL;' in async.c. DEBUG: Pending notification for relation kws_961_event_log. // Notice that NO notification is actually delivered. DEBUG: AtCommit_Notify: done // The event thread receives the result of the LISTEN operation. Event thread: result for listen to workspace 961. // The event thread polls the event log. Event thread: |select id, type, event from kws_961_event_log where id > 0|. // The event thread receives the poll results. Notice that only two // events are actually retrieved. Event thread: fetched 2 events for workspace 961. Event thread: got event ID 1 type 402784512. Event thread: got event ID 2 type 402785024. // No subsequent notification is received, even though PQconsumeInput() // and PQnotifies() are called subsequently and repeatedly. A // notification is only received when another event is posted. From the observed sequence of events, it would appear that the event thread inserts a tuple in the pg_listener table BEFORE the command thread actually commits the transaction. However, when this transaction commits, Postgres does not actually find the event thread's tuple in this table. Consequently, one has to assume that the tuple in the pg_listener table is inserted AFTER the AtCommit_Notify() function has executed. But, when the event thread later polls the event log, it does not actually retrieve the third event. This suggests that the transaction has not yet been commited. I have three hypotheseses at this point: 1) The pg_listener table is desynchronized. 2) The function AtCommit_Notify() does not hold the lock on the pg_listener table until the transaction is fully committed, so the event thread has the time to listen and poll the event log in that small window of time. 3) I screwed up somewhere (most likely hypothesis :-). Can somebody help me figure out what is going on? That would be much appreciated! Thanks in advance, Laurent Birtz
Laurent Birtz <laurent.birtz@kryptiva.com> writes: > From the observed sequence of events, it would appear that the event > thread inserts a tuple in the pg_listener table BEFORE the command thread > actually commits the transaction. However, when this transaction commits, > Postgres does not actually find the event thread's tuple in this table. I'm wondering exactly when you commit the LISTEN? The command thread can't see the pg_listener tuple until it's committed ... regards, tom lane
Tom Lane wrote: > Laurent Birtz <laurent.birtz@kryptiva.com> writes: > >> From the observed sequence of events, it would appear that the event >> thread inserts a tuple in the pg_listener table BEFORE the command thread >> actually commits the transaction. However, when this transaction commits, >> Postgres does not actually find the event thread's tuple in this table. >> > > I'm wondering exactly when you commit the LISTEN? The command thread > can't see the pg_listener tuple until it's committed ... > I'm executing the LISTEN statement outside a transaction block. I presume Postgres implicitly adds the BEGIN and COMMIT statements in that case? Just to be clear, the SELECT statement that retrieve new events is also executed outside a transaction block. Thanks for the help, Laurent Birtz
Hello, I've figured out the LISTEN / NOTIFY race condition I had previously encountered. It is not trivial, so I'll try to give as much details as possible to describe what is going on. Here is a description of the two Postgres functions involved in the race condition: void CommitTransaction() { ... /* NOTIFY commit must come before lower-level cleanup */ AtCommit_Notify(); ... /* * Here is where we really truly commit. */ latestXid = RecordTransactionCommit(); ... /* * This is all post-commit cleanup. Note that if an error is raised here, * it's too late to abort the transaction. This should be just * noncritical resource releasing. * * The ordering of operations is not entirely random. The idea is: * release resources visible to other backends (eg, files, buffer pins); * then release locks; then release backend-local resources. We want to * release locks at the point where any backend waiting for us will see * our transaction as being fully cleaned up. * * Resources that can be associated with individual queries are handled by * the ResourceOwner mechanism. The other calls here are for backend-wide * state. */ ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_LOCKS, true, true); ... } void Async_Listen(const char *relname) { ... lRel = heap_open(ListenerRelationId, ExclusiveLock); ... /* * OK to insert a new tuple */ simple_heap_insert(lRel, tuple); ... heap_close(lRel, ExclusiveLock); ... } In summary, CommitTransaction() notifies FIRST and commits AFTER. Async_Listen() releases its lock BEFORE the end of its transaction. I've written a program that triggers the bug quickly. Here is what this program do: 1) The program deletes every row of the table 'the_log'. This table contains an AFTER-INSERT trigger that notifies processes listening on it. 2) The program starts a command thread and an event thread concurrently. 3) The command thread executes the following statements: a) BEGIN TRANSACTION b) INSERT INTO the_log ... c) COMMIT 4) The event thread executes 'LISTEN the_log'. 5) The event thread executes 'SELECT * FROM the_log'. 6) If the event thread found the event, we clean up and go back to 1). 7) The event thread waits for a notification. If the notification is received, we clean up and go back to 1). 8) We reproduced the bug. I've instrumented the Postgres source to show what is going on in the functions of interest. I print both the executing process PID and the current time in microseconds. My machine has only one CPU. Hence, I'm pretty sure the following trace accurately reflects what is happening when my program is executing. Note: PID 29295 is the backend for the command thread. PID 29296 is the backend for the event thread. Blank lines correspond to context switches. Command thread: executing |start transaction|. Event thread: executing |LISTEN the_log|. // The command backend is handling 'start transaction'. DEBUG: [PID 29295] [Time 497310]: BeginTransactionBlock() called for BEGIN Command thread: executing |insert into the_log...|. // The event backend is doing the listen. DEBUG: [PID 29296] [Time 499744]: StartTransaction() called DEBUG: Async_Listen(the_log,29296) DEBUG: [PID 29296] [Time 501046]: CommitTransaction() called DEBUG: [PID 29296] [Time 501163]: RecordTransactionCommit() called // The command backend is handling 'insert'. DEBUG: Async_Notify(the_log) DEBUG: [PID 29295] [Time 502624]: CommitTransactionCommand: incrementing command counter Command thread: executing |commit|. // The command backend is handling 'commit'. DEBUG: [PID 29295] [Time 503192]: CommitTransaction() called DEBUG: [PID 29295] AtCommit_Notify DEBUG: [PID 29295] AtCommit_Notify: done DEBUG: [PID 29295] [Time 507981]: RecordTransactionCommit() called // The event backend is still handling 'listen'. DEBUG: [PID 29296] [Time 508319]: CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS. Event thread: executing |select * from the_log|. // The event backend is handling 'select'. DEBUG: [PID 29296] [Time 508754]: StartTransaction() called DEBUG: [PID 29296] [Time 510082]: CommitTransaction() called DEBUG: [PID 29296] [Time 510207]: RecordTransactionCommit() called DEBUG: [PID 29296] [Time 510326]: CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS. // The command thread is still handling 'commit'. DEBUG: [PID 29295] [Time 510690]: CommitTransaction: about to release with RESOURCE_RELEASE_LOCKS. Analysis: The command backend begins a transaction, then the event backend begins a transaction for LISTEN. The event backend updates the pg_listener table in mutual exclusion, but it releases the mutex before the transaction is commited. The command thread does its INSERT then starts its COMMIT. The command backend scans the pg_listener table and doesn't find anyone to notify. Before the command backend has the time to actually commit, the event backend commits and the listen thread has the time to execute the SELECT statement. Since the command thread didn't commit yet, the SELECT returns nothing. Finally, the command thread commits, without notifying anyone. Essentially, the event thread won't receive a notification for the events commited when it executes the LISTEN statement, and it won't necessarily see them either if it executes a SELECT statement subsequently. I believe this greatly limits the usefulness of the LISTEN / NOTIFY mechanism. Solutions: There are two solutions I developed: 1) Don't use an AFTER-INSERT trigger. Execute NOTIFY in its own top-level transaction. Cons: - Bad for performance. - Annoying kludge. - It's flawed. If the command process crashes before it commits the NOTIFY, the event won't be noticed. 2) Fix the race condition in Postgres. In Async_Listen(): change 'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'. I've tested this solution. It seems to work and I believe this is the best solution. I would appreciate receiving feedback on this issue. Thanks for your time, Laurent Birtz
Laurent Birtz <laurent.birtz@kryptiva.com> writes: > The command backend begins a transaction, then the event backend begins a > transaction for LISTEN. The event backend updates the pg_listener table > in mutual exclusion, but it releases the mutex before the transaction is > commited. The command thread does its INSERT then starts its COMMIT. The > command backend scans the pg_listener table and doesn't find anyone to > notify. Before the command backend has the time to actually commit, the > event backend commits and the listen thread has the time to execute the > SELECT statement. Since the command thread didn't commit yet, the SELECT > returns nothing. Finally, the command thread commits, without notifying > anyone. Hmm ... yup, that's a race condition all right, and of sufficiently low probability that it's not surprising it's gone undetected for so long. > In Async_Listen(): change > 'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'. This solution is pretty ugly, though, because we allow people to execute LISTEN/UNLISTEN in transaction blocks, which means that the ExclusiveLock could be held for quite some time. Not only is that bad for performance but it poses significant risks of deadlocks. What I am thinking is that we need to change LISTEN/UNLISTEN so that they don't attempt to modify the pg_listener table until COMMIT time. It would go about like this: 1. LISTEN and UNLISTEN would make entries in a transaction-lifespan list, much as NOTIFY does, with suitable logic to cancel each others' effects on the list as needed. They don't touch pg_listener at all. 2. If we abort the transaction then the list of pending actions is just thrown away. (Hmm, we'd also need to account for subtransactions...) 3. If we commit, then AtCommit_Notify is responsible for applying any pg_listener insertions and deletions indicated by the pending-actions list, after it grabs ExclusiveLock and before it starts the notify loop. (A CommandCounterIncrement between this step and the notify loop will assure that LISTEN and NOTIFY in the same transaction result in a self-notify just as before.) Since AtCommit_Notify doesn't release the lock, the race condition is fixed, and since it only happens immediately before commit, there is no added risk of deadlock. The only externally visible difference in behavior is that it's less likely for failed LISTENing transactions to leave dead tuples in pg_listener. There is an *internally* visible difference, in that a sequence like BEGIN; LISTEN foo; SELECT ... FROM pg_listener ... COMMIT; will fail to see any tuple from the LISTEN in pg_listener, where historically it did. I suppose it's conceivable that some application out there depends on this, but it doesn't seem very likely. (Slony guys want to speak up here?) Comments? Have I missed anything? Looking at this sketch, my first reaction is that it's a lot of code to write in support of an implementation we hope to throw away soon. But my second reaction is that we're going to need most of that code anyway in a pg_listener-less implementation, because the module will have to emulate the current transactional behavior of LISTEN/UNLISTEN without any table to store rows in. So there should be something salvageable from the effort. Assuming that we implement this (I'm willing to write the code), is it sane to back-patch such a non-trivial change? It's a bit scary but on the other hand we've back-patched larger changes when we had to. Leaving the race condition in place isn't appetizing, and neither is introducing deadlock risks that weren't there before. regards, tom lane
>> In Async_Listen(): change >> 'heap_close(lRel, ExclusiveLock);' for 'heap_close(lRel, NoLock);'. >> > > This solution is pretty ugly, though, because we allow people to > execute LISTEN/UNLISTEN in transaction blocks, which means that the > ExclusiveLock could be held for quite some time. Not only is that bad > for performance but it poses significant risks of deadlocks. > True, I had not considered this. > Comments? Have I missed anything? > As far as I can tell it seems fine. Thanks a lot for your input! Laurent Birtz
Tom Lane wrote: > This patch will apply against HEAD and 8.3, but not cleanly against > prior versions. Since this code hasn't changed materially (except for > additions of subtransaction and 2PC support) for a long time, > back-patching should be straightforward, but I haven't actually done > that yet. > > Comments? > I've reviewed the patch briefly and tested it against my version of Postgres (8.3/Debian) and the test program I wrote to reproduce the bug. I confirm the test program works fine. Thanks for the quick fix! Laurent Birtz