Thread: condition variables
Hi, Some of my EnterpriseDB colleagues and I have been working on various parallel query projects, all of which have been previously disclosed here: https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap One issue we've encountered is that it's not very easy for one process in a group of cooperating parallel processes to wait for another process in that same group. One idea is to have one process grab an LWLock and other processes try to acquire it, but that actually doesn't work very well. A pretty obvious problem is that it holds of interrupts for the entire time that you are holding the lock, which is pretty undesirable. A more subtle problem is that it's easy to conceive of situations where the LWLock paradigm is just a very poor fit for what you actually want to do. For example, suppose you have a computation which proceeds in two phases: each backend that finishes phase 1 must wait until all backends finish phase 1, and once all have finished, all can begin phase 2. You could handle this case by having an LWLock which everyone holds during phase 1 in shared mode, and then everyone must briefly acquire it in exclusive mode before starting phase 2, but that's an awful hack. It also has race conditions: what if someone finishes phase 1 before everyone has started phase 1? And what if there are 10 phases instead of 2? Another approach to the problem is to use a latch wait loop. That almost works. Interrupts can be serviced, and you can recheck shared memory to see whether the condition for proceeding is satisfied after each iteration of the loop. There's only one problem: when you do something that might cause the condition to be satisfied for other waiting backends, you need to set their latch - but you don't have an easy way to know exactly which processes are waiting, so how do you call SetLatch? I originally thought of adding a function like SetAllLatches(ParallelContext *) and maybe that can work, but then I had what I think is a better idea, which is to introduce a notion of condition variables. Condition variables, of course, are a standard synchronization primitive: https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2 Basically, a condition variable has three operations: you can wait for the condition variable; you can signal the condition variable to wake up one waiter; or you can broadcast on the condition variable to wake up all waiters. Atomically with entering the wait, you must be able to check whether the condition is satisfied. So, in my implementation, a condition variable wait loop looks like this: for (;;) { ConditionVariablePrepareToSleep(cv); if (condition for which we are waiting is satisfied) break; ConditionVariableSleep(); } ConditionVariableCancelSleep(); To wake up one waiter, another backend can call ConditionVariableSignal(cv); to wake up all waiters, ConditionVariableBroadcast(cv). I am cautiously optimistic that this design will serve a wide variety of needs for parallel query development - basically anything that needs to wait for another process to reach a certain point in the computation that can be detected through changes in shared memory state. The attached patch condition-variable-v1.patch implements this API. I originally open-coded the wait queue for this, but I've just finished rebasing it on top of Thomas Munro's proclist stuff, so before applying this patch you need the one from here: https://www.postgresql.org/message-id/CAEepm=0Vvr9zgwHt67RwuTfwMEby1GiGptBk3xFPDbbgEtZgMg@mail.gmail.com At some point while hacking on this I realized that we could actually replace the io_in_progress locks with condition variables; the attached patch buffer-io-cv-v1.patch does this (it must be applied on top of the proclist patch from the above email and also on top of condition-variable-v1.patch). Using condition variables here seems to have a couple of advantages. First, it means that a backend waiting for buffer I/O to complete is interruptible. Second, it fixes a long-running bit of nastiness in AbortBufferIO: right now, if a backend that is doing buffer I/O aborts, the abort causes it to release all of its LWLocks, including the buffer I/O lock. Everyone waiting for that buffer busy-loops until the aborting process gets around to reacquiring the lock and updating the buffer state in AbortBufferIO. But if we replace the io_in_progress locks with condition variables, then that doesn't happen any more. Nobody is "holding" the condition variable, so it doesn't get "released" when the process doing I/O aborts. Instead, they just keep sleeping until the aborting process reaches AbortBufferIO, and then it broadcasts on the condition variable and wakes everybody up, which seems a good deal nicer. I'm very curious to know whether other people like this abstraction and whether they think it will be useful for things they want to do with parallel query (or otherwise). Comments welcome. Review appreciated. Other suggestions for how to handle this are cool, too. Credit: These patches were written by me; an earlier version of the condition-variable-v1.patch was reviewed and tested by Rahila Syed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > I had what I think is a better idea, which is to introduce a notion of > condition variables. Interesting proposal. > ... Using condition variables here seems to > have a couple of advantages. First, it means that a backend waiting > for buffer I/O to complete is interruptible. Second, it fixes a > long-running bit of nastiness in AbortBufferIO: right now, if a > backend that is doing buffer I/O aborts, the abort causes it to > release all of its LWLocks, including the buffer I/O lock. Everyone > waiting for that buffer busy-loops until the aborting process gets > around to reacquiring the lock and updating the buffer state in > AbortBufferIO. But if we replace the io_in_progress locks with > condition variables, then that doesn't happen any more. Nobody is > "holding" the condition variable, so it doesn't get "released" when > the process doing I/O aborts. Instead, they just keep sleeping until > the aborting process reaches AbortBufferIO, and then it broadcasts on > the condition variable and wakes everybody up, which seems a good deal > nicer. Hmm. I fear the only reason you see an advantage there is that you don't (yet) have any general-purpose mechanism for an aborting transaction to satisfy its responsibilities vis-a-vis waiters on condition variables. Instead, this wins specifically because you stuck some bespoke logic into AbortBufferIO. OK ... but that sounds like we're going to end up with every single condition variable that ever exists in the system needing to be catered for separately and explicitly during transaction abort cleanup. Which does not sound promising from a reliability standpoint. On the other hand, I don't know what the equivalent rule to "release all LWLocks during abort" might look like for condition variables, so I don't know if it's even possible to avoid that. I encourage you to pursue this, because indeed LWLocks aren't always an ideal solution, but I think it requires some careful thought about what transaction aborts will do with condition variables. regards, tom lane
On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Another approach to the problem is to use a latch wait loop. That > almost works. Interrupts can be serviced, and you can recheck shared > memory to see whether the condition for proceeding is satisfied after > each iteration of the loop. There's only one problem: when you do > something that might cause the condition to be satisfied for other > waiting backends, you need to set their latch - but you don't have an > easy way to know exactly which processes are waiting, so how do you > call SetLatch? That's what I ended up doing with parallel CREATE INDEX. It worked, but it would be nice to have a general purpose facility for waiting for conditions to change. > https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2 > > Basically, a condition variable has three operations: you can wait for > the condition variable; you can signal the condition variable to wake > up one waiter; or you can broadcast on the condition variable to wake > up all waiters. Atomically with entering the wait, you must be able > to check whether the condition is satisfied. So, in my > implementation, a condition variable wait loop looks like this: > > for (;;) > { > ConditionVariablePrepareToSleep(cv); > if (condition for which we are waiting is satisfied) > break; > ConditionVariableSleep(); > } > ConditionVariableCancelSleep(); > > To wake up one waiter, another backend can call > ConditionVariableSignal(cv); to wake up all waiters, > ConditionVariableBroadcast(cv). This seems convenient. I notice that you acquire a spinlock within the implementation of condition variables. Is it worth any effort to consolidate the number of spinlock acquisitions? In other words, maybe the most common idioms should be baked into the ConditionVariable interface, which could save callers from having to use their own mutex variable. -- Peter Geoghegan
On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2 > > Basically, a condition variable has three operations: you can wait for > the condition variable; you can signal the condition variable to wake > up one waiter; or you can broadcast on the condition variable to wake > up all waiters. Atomically with entering the wait, you must be able > to check whether the condition is satisfied. So, in my > implementation, a condition variable wait loop looks like this: > > for (;;) > { > ConditionVariablePrepareToSleep(cv); > if (condition for which we are waiting is satisfied) > break; > ConditionVariableSleep(); > } > ConditionVariableCancelSleep(); > > To wake up one waiter, another backend can call > ConditionVariableSignal(cv); to wake up all waiters, > ConditionVariableBroadcast(cv). It is interesting to compare this interface with Wikipedia's description, POSIX's pthread_cond_t and C++'s std::condition_variable. In those interfaces, the wait operation takes a mutex which must already be held by the caller. It unlocks the mutex and begins waiting atomically. Then when it returns, the mutex is automatically reacquired. This approach avoids race conditions as long as the shared state change you are awaiting is protected by that mutex. If you check that state before waiting and while still holding the lock, you can be sure not to miss any change signals, and then when it returns you can check the state again and be sure that no one can be concurrently changing it. In contrast, this proposal leaves it up to client code to get that right, similarly to the way you need to do things in a certain order when waiting for state changes with latches. You could say that it's more error prone: I think there have been a few cases of incorrectly coded latch/state-change wait loops in the past. On the other hand, it places no requirements on the synchronisation mechanism the client code uses for the related shared state. pthread_cond_wait requires you to pass in a pointer to the related pthread_mutex_t, whereas with this proposal client code is free to use atomic ops, lwlocks, spinlocks or any other mutual exclusion mechanism to coordinate state changes and deal with cache coherency. Then there is the question of what happens when the backend that is supposed to be doing the signalling dies or aborts, which Tom Lane referred to in his reply. In those other libraries there is no such concern: it's understood that these are low level thread synchronisation primitives and if you're waiting for something that never happens, you'll be waiting forever. I don't know what the answer is in general for Postgres condition variables, but... The thing that I personally am working on currently that is very closely related and could use this has a more specific set of circumstances: I want "join points" AKA barriers. Something like pthread_barrier_t. (I'm saying "join point" rather than "barrier" to avoid confusion with compiler and memory barriers, barrier.h etc.) Join points let you wait for all workers in a known set to reach a given point, possibly with a phase number or at least sense (one bit phase counter) to detect synchronisation bugs. They also select one worker arbitrarily to receive a different return value when releasing workers from a join point, for cases where a particular phase of parallel work needs to be done by exactly one worker while the others sit on the bench: for example initialisation, cleanup or merging (CF PTHREAD_BARRIER_SERIAL_THREAD). Clearly a join point could be not much more than a condition variable and some state tracking arrivals and departures, but I think that this higher level synchronisation primitive might have an advantage over raw condition variables in the abort case: it can know the total set of workers that its waiting for, if they are somehow registered with it first, and registration can include arranging for cleanup hooks to do the right thing. It's already a requirement for a join point to know which workers exist (or at least how many). Then the deal would then be that when you call joinpoint_join(&some_joinpoint, phase), it will return only when all peers have joined or detached, where the latter happens automatically if they abort or die. Not at all sure of the details yet... but I suspect join points are useful for a bunch of things like parallel sort, parallel hash join (my project), and anything else involving phases or some form of "fork/join" parallelism. Or perhaps that type of thinking about error handling should be pushed down to the condition variable. How would that look: all potential signallers would have to register to deliver a goodbye signal in their abort and shmem exit paths? Then what happens if you die before registering? I think even if you find a way to do that I'd still need to do similar extra work on top for my join points concept, because although I do need waiters to be poked at the time worker aborts or dies, one goodbye prod isn't enough: I'd also need to adjust the join point's set of workers, or put it into error state. -- Thomas Munro http://www.enterprisedb.com
On Thu, Aug 11, 2016 at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But if we replace the io_in_progress locks with >> condition variables, then that doesn't happen any more. Nobody is >> "holding" the condition variable, so it doesn't get "released" when >> the process doing I/O aborts. Instead, they just keep sleeping until >> the aborting process reaches AbortBufferIO, and then it broadcasts on >> the condition variable and wakes everybody up, which seems a good deal >> nicer. > > Hmm. I fear the only reason you see an advantage there is that you don't > (yet) have any general-purpose mechanism for an aborting transaction to > satisfy its responsibilities vis-a-vis waiters on condition variables. In the specific case of parallel query, this problem doesn't really arise: if an error happens, it will be propagated back to the leader (unless it occurs in the leader as an initial matter) and the leader will then kill of all of the other workers. Since waiting for a condition variable doesn't block interrupts, the workers will all stop waiting and die like the obedient lemmings they are. > Instead, this wins specifically because you stuck some bespoke logic into > AbortBufferIO. OK ... but that sounds like we're going to end up with > every single condition variable that ever exists in the system needing to > be catered for separately and explicitly during transaction abort cleanup. > Which does not sound promising from a reliability standpoint. On the > other hand, I don't know what the equivalent rule to "release all LWLocks > during abort" might look like for condition variables, so I don't know > if it's even possible to avoid that. I don't think it is possible to avoid that. Certainly the fact that LWLocks are automatically released is in most scenarios a big advantage, but for content locks it isn't. I note that I didn't so much insert bespoke logic there as replace the existing bespoke logic with less-ugly bespoke logic. One way to conceptualize a condition variable is that it is the wait queue for an LWLock without the lock itself. The "lock modes" are defined by the interaction between the condition checked by waiters before sleeping and the shared memory updates performed before signalling or broadcasting. This separation of concerns allows for enormous flexibility - you could use CVs to implement an LWLock-like construct with an arbitrary set of lock modes and an arbitrary conflict matrix, for example. But I think that in most cases it will be best to leave mutual exclusion to LWLocks, and use CVs when what we need is not mutual exclusion but rather waiting for an operation performed by some other backend to complete (successfully or otherwise). The IO-in-progress example is just such a case: the current spinning behavior arises from the fact that the buffer's I/O-in-progress flag does not get cleared until AFTER processes has been dropped off the wait queue. The normal pattern with a CV will be (1) perform a shared memory update that will, if seen by other processes, confirm that the condition has been met and then (2) broadcast on the CV. There's no way for transaction abort to know what (1) looks like, even if somehow were made aware of the CV so that it could do (2). > I encourage you to pursue this, because indeed LWLocks aren't always > an ideal solution, but I think it requires some careful thought about > what transaction aborts will do with condition variables. Thanks. As stated above, I believe the right answer is in fact "nothing", because this facility is too low-level to permit a categorical answer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 11, 2016 at 6:37 PM, Peter Geoghegan <pg@heroku.com> wrote: > I notice that you acquire a spinlock within the implementation of > condition variables. Is it worth any effort to consolidate the number > of spinlock acquisitions? In other words, maybe the most common idioms > should be baked into the ConditionVariable interface, which could save > callers from having to use their own mutex variable. One thing to keep in mind is that spinlocks are extremely fast as long as you don't have too many processes contending for them. With parallel groups (or I/O-in-progress wait queues) of single digit number of processes, I doubt that consolidating the spinlock acquisitions will produce any measurable benefit. If we get to the point of having parallel groups containing scores of processes, that could change. Also, right now we don't have enough users of the CV interface to know what the most common idioms will be. We could speculate, but if we do, I'll start of the speculation by guessing that there will be a lot of diversity, and not too much that keeps getting repeated. If that proves to be wrong, of course, we can always go back and change it later. We're not writing this on stone tablets. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 11, 2016 at 8:44 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > In contrast, this proposal leaves it up to client code to get that > right, similarly to the way you need to do things in a certain order > when waiting for state changes with latches. You could say that it's > more error prone: I think there have been a few cases of incorrectly > coded latch/state-change wait loops in the past. On the other hand, > it places no requirements on the synchronisation mechanism the client > code uses for the related shared state. pthread_cond_wait requires > you to pass in a pointer to the related pthread_mutex_t, whereas with > this proposal client code is free to use atomic ops, lwlocks, > spinlocks or any other mutual exclusion mechanism to coordinate state > changes and deal with cache coherency. I think you have accurately stated the pros and cons of this approach. On the whole I think it's a good trade-off. In particular, not being locked into a specific synchronization method seems like a good idea from here. If you had to pick just one, it would be hard to decide between spinlocks and LWLocks, and "atomics" isn't a sufficiently specific thing to code to. > Then there is the question of what happens when the backend that is > supposed to be doing the signalling dies or aborts, which Tom Lane > referred to in his reply. In those other libraries there is no such > concern: it's understood that these are low level thread > synchronisation primitives and if you're waiting for something that > never happens, you'll be waiting forever. I don't know what the > answer is in general for Postgres condition variables, but... As I noted in my reply to Tom, for parallel query, we're going to kill all the workers at the same time, so the problem doesn't arise. When we use this mechanism outside that context, we just have to do it correctly. I don't think that's especially hard, but could somebody mess it up? Sure. > The thing that I personally am working on currently that is very > closely related and could use this has a more specific set of > circumstances: I want "join points" AKA barriers. Something like > pthread_barrier_t. (I'm saying "join point" rather than "barrier" to > avoid confusion with compiler and memory barriers, barrier.h etc.) > Join points let you wait for all workers in a known set to reach a > given point, possibly with a phase number or at least sense (one bit > phase counter) to detect synchronisation bugs. They also select one > worker arbitrarily to receive a different return value when releasing > workers from a join point, for cases where a particular phase of > parallel work needs to be done by exactly one worker while the others > sit on the bench: for example initialisation, cleanup or merging (CF > PTHREAD_BARRIER_SERIAL_THREAD). Clearly a join point could be not > much more than a condition variable and some state tracking arrivals > and departures, but I think that this higher level synchronisation > primitive might have an advantage over raw condition variables in the > abort case: it can know the total set of workers that its waiting for, > if they are somehow registered with it first, and registration can > include arranging for cleanup hooks to do the right thing. It's > already a requirement for a join point to know which workers exist (or > at least how many). Then the deal would then be that when you call > joinpoint_join(&some_joinpoint, phase), it will return only when all > peers have joined or detached, where the latter happens automatically > if they abort or die. Not at all sure of the details yet... but I > suspect join points are useful for a bunch of things like parallel > sort, parallel hash join (my project), and anything else involving > phases or some form of "fork/join" parallelism. If I'm right that the abort/die case doesn't really need any special handling here, then I think this gets a lot simpler. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [condition-variable-v1.patch] Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? -- Thomas Munro http://www.enterprisedb.com
On Sun, Aug 14, 2016 at 9:04 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> [condition-variable-v1.patch] > > Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? I poked at this a bit... OK, a lot... and have some feedback: 1. As above, we need to clear cvSleeping before setting the latch. 2. The following schedule corrupts the waitlist by trying to delete something from it that isn't in it: P1: ConditionVariablePrepareToSleep: push self onto waitlist P2: ConditionVariableSignal: pop P1 from waitlist P1: <check user condition, decide to break without ever sleeping> P3: ConditionVariablePrepareToSleep: push self onto waitlist P1: ConditionVariableCancelSleep: delete self from waitlist (!) Without P3 coming along you probably wouldn't notice because the waitlist will be happily cleared and look valid, but P3's entry gets lost and then it hangs forever. One solution is to teach ConditionVariableCancelSleep to check if we're still actually there first. That can be done in O(1) time by clearing nodes' next and prev pointers when deleting, so that we can rely on that in a new proclist_contains() function. See attached. 3. The following schedule corrupts the waitlist by trying to insert something into it that is already in it: P1: ConditionVariablePrepareToSleep: push self onto waitlist P1: <check user condition, decide to sleep> P1: ConditionVariableSleep P1: ConditionVariablePrepareToSleep: push self onto waitlist (!) Nodes before and after self's pre-existing position can be forgotten when self's node is pushed to the front of the list. That can be fixed by making ConditionVariablePrepareToSleep also check if we're already in the list. 4. The waitlist is handled LIFO-ly. Would it be better for the first guy to start waiting to be woken up first, like we do for locks? The Pthreads equivalent says that it depends on "scheduling policy". I don't know if it matters much, just an observation. 5. The new proclist function you added is the first to work in terms of PGPROC* rather than procno. Maybe the whole interface should work with either PGPROC pointers or procnos? No strong view. Please find attached a -v2 of your patch which includes suggestions 1-3 above. Like the -v1, it applies on top of lwlocks-in-dsm-v3.patch. Also, I have attached a v2->v3 diff to show just my proposed changes. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Also, I have attached a v2->v3 diff ... Ugh. I meant a v1->v2 diff. -- Thomas Munro http://www.enterprisedb.com
On Mon, Aug 15, 2016 at 1:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Aug 14, 2016 at 9:04 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> [condition-variable-v1.patch] >> >> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? > > I poked at this a bit... OK, a lot... and have some feedback: > > 1. As above, we need to clear cvSleeping before setting the latch. Right, OK. > 2. The following schedule corrupts the waitlist by trying to delete > something from it that isn't in it: > > P1: ConditionVariablePrepareToSleep: push self onto waitlist > P2: ConditionVariableSignal: pop P1 from waitlist > P1: <check user condition, decide to break without ever sleeping> > P3: ConditionVariablePrepareToSleep: push self onto waitlist > P1: ConditionVariableCancelSleep: delete self from waitlist (!) > > Without P3 coming along you probably wouldn't notice because the > waitlist will be happily cleared and look valid, but P3's entry gets > lost and then it hangs forever. > > One solution is to teach ConditionVariableCancelSleep to check if > we're still actually there first. That can be done in O(1) time by > clearing nodes' next and prev pointers when deleting, so that we can > rely on that in a new proclist_contains() function. See attached. How about instead using cvSleeping to test this? Suppose we make a rule that cvSleeping can be changed from false to true only by the process whose PGPROC it is, and thus no lock is needed, but changing it from true to false always requires the spinlock. > 3. The following schedule corrupts the waitlist by trying to insert > something into it that is already in it: > > P1: ConditionVariablePrepareToSleep: push self onto waitlist > P1: <check user condition, decide to sleep> > P1: ConditionVariableSleep > P1: ConditionVariablePrepareToSleep: push self onto waitlist (!) > > Nodes before and after self's pre-existing position can be forgotten > when self's node is pushed to the front of the list. That can be > fixed by making ConditionVariablePrepareToSleep also check if we're > already in the list. OK. > 4. The waitlist is handled LIFO-ly. Would it be better for the first > guy to start waiting to be woken up first, like we do for locks? The > Pthreads equivalent says that it depends on "scheduling policy". I > don't know if it matters much, just an observation. I don't know whether this matters. It's possible that FIFO is a better policy; I don't really care. > 5. The new proclist function you added is the first to work in terms > of PGPROC* rather than procno. Maybe the whole interface should work > with either PGPROC pointers or procnos? No strong view. Hmm, maybe so. But wouldn't any caller translate to a PGPROC * straight off? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-08-11 21:27:45 -0400, Robert Haas wrote: > On Thu, Aug 11, 2016 at 6:37 PM, Peter Geoghegan <pg@heroku.com> wrote: > > I notice that you acquire a spinlock within the implementation of > > condition variables. Is it worth any effort to consolidate the number > > of spinlock acquisitions? In other words, maybe the most common idioms > > should be baked into the ConditionVariable interface, which could save > > callers from having to use their own mutex variable. > > One thing to keep in mind is that spinlocks are extremely fast as long > as you don't have too many processes contending for them. That's one of the conditions. The other is that the system as a whole is not overcommitted. Because then the chance of processes being put to sleep inside a spinlock increases. > With > parallel groups (or I/O-in-progress wait queues) of single digit > number of processes, I doubt that consolidating the spinlock > acquisitions will produce any measurable benefit. If we get to the > point of having parallel groups containing scores of processes, that > could change. And we have no measures to manage systemwide load with paralellism yet, I think the issue is a bit more general than the quoted paragraph. But I also think we shouldn't yet worry about it. It seems likely that the actual critical bottleneck is elsewhere for now. Andres Freund
On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? >> >> I poked at this a bit... OK, a lot... and have some feedback: >> >> 1. As above, we need to clear cvSleeping before setting the latch. > > Right, OK. > I have independently faced this problem while using your patch and for now I have updated my local copy. If possible, please send an updated patch as this patch could be used for development of various parallelism projects. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 5, 2016 at 3:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? >>> >>> I poked at this a bit... OK, a lot... and have some feedback: >>> >>> 1. As above, we need to clear cvSleeping before setting the latch. >> >> Right, OK. >> > > I have independently faced this problem while using your patch and for > now I have updated my local copy. If possible, please send an updated > patch as this patch could be used for development of various > parallelism projects. Thomas already posted an updated patch in the same message where he reported the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 6, 2016 at 5:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 5, 2016 at 3:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>>> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal? >>>> >>>> I poked at this a bit... OK, a lot... and have some feedback: >>>> >>>> 1. As above, we need to clear cvSleeping before setting the latch. >>> >>> Right, OK. >>> >> >> I have independently faced this problem while using your patch and for >> now I have updated my local copy. If possible, please send an updated >> patch as this patch could be used for development of various >> parallelism projects. > > Thomas already posted an updated patch in the same message where he > reported the problem. > Oops, I missed that, will use the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Another approach to the problem is to use a latch wait loop. That > almost works. Interrupts can be serviced, and you can recheck shared > memory to see whether the condition for proceeding is satisfied after > each iteration of the loop. There's only one problem: when you do > something that might cause the condition to be satisfied for other > waiting backends, you need to set their latch - but you don't have an > easy way to know exactly which processes are waiting, so how do you > call SetLatch? I originally thought of adding a function like > SetAllLatches(ParallelContext *) and maybe that can work, but then I > had what I think is a better idea, which is to introduce a notion of > condition variables. I don't see a CF entry for this. Are you planning to work on this again soon, Robert? I have an eye on this patch due to my work on parallel CREATE INDEX. It would be nice to have some rough idea of when you intend to commit this. -- Peter Geoghegan
On Tue, Sep 13, 2016 at 10:55 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Another approach to the problem is to use a latch wait loop. That >> almost works. Interrupts can be serviced, and you can recheck shared >> memory to see whether the condition for proceeding is satisfied after >> each iteration of the loop. There's only one problem: when you do >> something that might cause the condition to be satisfied for other >> waiting backends, you need to set their latch - but you don't have an >> easy way to know exactly which processes are waiting, so how do you >> call SetLatch? I originally thought of adding a function like >> SetAllLatches(ParallelContext *) and maybe that can work, but then I >> had what I think is a better idea, which is to introduce a notion of >> condition variables. > > I don't see a CF entry for this. Are you planning to work on this > again soon, Robert? > > I have an eye on this patch due to my work on parallel CREATE INDEX. > It would be nice to have some rough idea of when you intend to commit > this. I basically figured I would commit it when and if it became clear that it'd get good use in some other patch which was on the road to being committed. I don't think it needs much work, just the assurance that it will get some use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<p dir="ltr">Maybe I can leave it up to you to determine if that applies in the context of my parallel create index patch.You are the obvious candidate to review that patch anyway, of course.<p dir="ltr">--<br /> Peter Geoghegan
On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Please find attached a -v2 of your patch which includes suggestions > 1-3 above. Here's a rebased patch. ConditionVariableSleep now takes wait_event_info. Anyone using this in patches for core probably needs to add enumerators to the WaitEventXXX enums in pgstat.h to describe their wait points. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Oct 4, 2016 at 12:12 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a rebased patch. ConditionVariableSleep now takes > wait_event_info. Anyone using this in patches for core probably needs > to add enumerators to the WaitEventXXX enums in pgstat.h to describe > their wait points. I think that there are at least 2 patches from EDB people that are already dependent on this one. I myself could easily adopt parallel CREATE INDEX to use it, too. Why the continued delay in committing it? I think it's fairly clear that this mechanism is widely useful. -- Peter Geoghegan
On Tue, Oct 4, 2016 at 3:12 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Please find attached a -v2 of your patch which includes suggestions >> 1-3 above. > > Here's a rebased patch. ConditionVariableSleep now takes > wait_event_info. Anyone using this in patches for core probably needs > to add enumerators to the WaitEventXXX enums in pgstat.h to describe > their wait points. So, in my original patch, it was intended that cvSleeping was the definitive source of truth as to whether a particular backend is committed to sleeping or not. That had some bugs, I guess, but in your version, there are now two sources of truth. On the one hand, there's still cvSleeping. On the other hand, there's now also whether proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink) returns true. I'm not sure whether that causes any serious problem. It seems possible, for example, that ConditionVariableSignal() could clear the cvSleeping flag for a backend that's now waiting for some OTHER condition variable, because once it pops the victim off the list and releases the spinlock, the other backend could meanwhile ConditionVariableCancelSleep() and then do anything it likes, including sleep on some other condition variable. Now I think that the worst thing that will happen is that the other backend will receive a spurious wakeup, but I'm not quite sure. The whole point of having cvSleeping in the first place is so that we don't get spurious wakeups just because somebody happens to set your process latch, so it seems a bit unfortunate that it doesn't actually prevent that from happening. I wonder if we shouldn't try to create the invariant that when the CV mutex is not help, the state of cvSleeping has to be true if we're in the proclist and false if we're not. So ConditionVariableSignal() would clear the flag before releasing the spinlock, for example. Then I think we wouldn't need proclist_contains(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 28, 2016 at 9:38 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 4, 2016 at 3:12 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> Please find attached a -v2 of your patch which includes suggestions >>> 1-3 above. >> >> Here's a rebased patch. ConditionVariableSleep now takes >> wait_event_info. Anyone using this in patches for core probably needs >> to add enumerators to the WaitEventXXX enums in pgstat.h to describe >> their wait points. > > So, in my original patch, it was intended that cvSleeping was the > definitive source of truth as to whether a particular backend is > committed to sleeping or not. That had some bugs, I guess, but in > your version, there are now two sources of truth. On the one hand, > there's still cvSleeping. On the other hand, there's now also whether > proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink) returns > true. > > I'm not sure whether that causes any serious problem. It seems > possible, for example, that ConditionVariableSignal() could clear the > cvSleeping flag for a backend that's now waiting for some OTHER > condition variable, because once it pops the victim off the list and > releases the spinlock, the other backend could meanwhile > ConditionVariableCancelSleep() and then do anything it likes, > including sleep on some other condition variable. Now I think that > the worst thing that will happen is that the other backend will > receive a spurious wakeup, but I'm not quite sure. The whole point of > having cvSleeping in the first place is so that we don't get spurious > wakeups just because somebody happens to set your process latch, so it > seems a bit unfortunate that it doesn't actually prevent that from > happening. > > I wonder if we shouldn't try to create the invariant that when the CV > mutex is not help, the state of cvSleeping has to be true if we're in > the proclist and false if we're not. So ConditionVariableSignal() > would clear the flag before releasing the spinlock, for example. Then > I think we wouldn't need proclist_contains(). Yeah, that'd work. ConditionVariableSleep would need to hold the spinlock while checking cvSleeping (otherwise there is race case where another process sets it to false but this process doesn't see that yet and then waits for a latch-set which never arrives). It's not the end of the world but it seems unfortunate to have to acquire and release the spinlock in ConditionVariablePrepareToSleep and then immediately again in ConditionVariableSleep. I was going to code that up but I read this and had another idea: http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups I realise that you didn't actually say you wanted to guarantee no spurious wakeups. But since the client code already has to check its own exit condition, why bother adding a another layer of looping? Sprurious latch sets must be super rare, but even if they aren't, what do you save by filtering them here? In this situation you already got woken from a deep slumbler and scheduled back onto the CPU, so it hardly matters whether you go around again in that loop or the client's loop. We could make things really simple instead: get rid of cvSleeping, have ConditionVariablePrepareToSleep reset the latch, then have ConditionVariableSleep wait for the latch to be set just once (no loop). Then we'd document that spurious wakeups are possible so the caller must write a robust predicate loop, exactly as you already showed in your first message. We'd need to keep that proclist_contains stuff to avoid corrupting the list. proclist_contains would be the one source of truth for whether you're in the waitlist, and the client code's predicate loop would contain the one source of truth for whether the condition it is waiting for has been reached. Thoughts? -- Thomas Munro http://www.enterprisedb.com
On Mon, Nov 21, 2016 at 7:10 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I wonder if we shouldn't try to create the invariant that when the CV >> mutex is not help, the state of cvSleeping has to be true if we're in >> the proclist and false if we're not. So ConditionVariableSignal() >> would clear the flag before releasing the spinlock, for example. Then >> I think we wouldn't need proclist_contains(). > > Yeah, that'd work. ConditionVariableSleep would need to hold the > spinlock while checking cvSleeping (otherwise there is race case where > another process sets it to false but this process doesn't see that yet > and then waits for a latch-set which never arrives). It's not the end > of the world but it seems unfortunate to have to acquire and release > the spinlock in ConditionVariablePrepareToSleep and then immediately > again in ConditionVariableSleep. > > I was going to code that up but I read this and had another idea: > > http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups > > I realise that you didn't actually say you wanted to guarantee no > spurious wakeups. But since the client code already has to check its > own exit condition, why bother adding a another layer of looping? > Sprurious latch sets must be super rare, but even if they aren't, what > do you save by filtering them here? In this situation you already got > woken from a deep slumbler and scheduled back onto the CPU, so it > hardly matters whether you go around again in that loop or the > client's loop. We could make things really simple instead: get rid of > cvSleeping, have ConditionVariablePrepareToSleep reset the latch, then > have ConditionVariableSleep wait for the latch to be set just once (no > loop). Then we'd document that spurious wakeups are possible so the > caller must write a robust predicate loop, exactly as you already > showed in your first message. We'd need to keep that > proclist_contains stuff to avoid corrupting the list. > proclist_contains would be the one source of truth for whether you're > in the waitlist, and the client code's predicate loop would contain > the one source of truth for whether the condition it is waiting for > has been reached. I don't think we can rely on spurious latch set events being rare. There are an increasing number of things that set the process latch, and there will very likely be more in the future. For instance, the arrival of tuples from a parallel worker associated with our session will set the process latch. Workers starting or dying will set the process latch. So my inclination was to try to guarantee no spurious wakeups at all, but possibly a softer guarantee that makes them unlikely would be sufficient. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So, in my > implementation, a condition variable wait loop looks like this: > > for (;;) > { > ConditionVariablePrepareToSleep(cv); > if (condition for which we are waiting is satisfied) > break; > ConditionVariableSleep(); > } > ConditionVariableCancelSleep(); I have what I think is a better idea. Let's get rid of ConditionVariablePrepareToSleep(cv) and instead tell users of this facility to write the loop this way: for (;;) { if (condition for which we are waiting is satisfied) break; ConditionVariableSleep(cv); } ConditionVariableCancelSleep(); ConditionVariableSleep(cv) will check whether the current process is already on the condition variable's waitlist. If so, it will sleep; if not, it will add the process and return without sleeping. It may seem odd that ConditionVariableSleep(cv) doesn't necessary sleep, but this design has a significant advantage: we avoid manipulating the wait-list altogether in the case where the condition is already satisfied when we enter the loop. That's more like what we already do in lwlock.c: we try to grab the lock first; if we can't, we add ourselves to the wait-list and retry; if we then get the lock after all we have to recheck whether we can get the lock and remove ourselves from the wait-list if so. Of course, there is some cost: if we do have to wait, we'll end up checking the condition twice before actually going to sleep. However, it's probably smart to bet that actually needing to sleep is fairly infrequent, just as in lwlock.c. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, At Mon, 21 Nov 2016 15:57:47 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobFjwcFEiq8j+fvH5CdXHdVJffmemNLq8MqFesg2+4Gwg@mail.gmail.com> > On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > So, in my > > implementation, a condition variable wait loop looks like this: > > > > for (;;) > > { > > ConditionVariablePrepareToSleep(cv); > > if (condition for which we are waiting is satisfied) > > break; > > ConditionVariableSleep(); > > } > > ConditionVariableCancelSleep(); > > I have what I think is a better idea. Let's get rid of > ConditionVariablePrepareToSleep(cv) and instead tell users of this > facility to write the loop this way: > > for (;;) > { > if (condition for which we are waiting is satisfied) > break; > ConditionVariableSleep(cv); > } > ConditionVariableCancelSleep(); It seems rather a common way to wait on a condition variable, in shorter, | while (condition for which we are waiting is *not* satisfied) | ConditionVariableSleep(cv); | ConditionVariableCancelSleep(); > ConditionVariableSleep(cv) will check whether the current process is > already on the condition variable's waitlist. If so, it will sleep; > if not, it will add the process and return without sleeping. > > It may seem odd that ConditionVariableSleep(cv) doesn't necessary > sleep, but this design has a significant advantage: we avoid > manipulating the wait-list altogether in the case where the condition > is already satisfied when we enter the loop. That's more like what we The condition check is done far faster than maintaining the wait-list for most cases, I believe. > already do in lwlock.c: we try to grab the lock first; if we can't, we > add ourselves to the wait-list and retry; if we then get the lock > after all we have to recheck whether we can get the lock and remove > ourselves from the wait-list if so. Of course, there is some cost: if > we do have to wait, we'll end up checking the condition twice before > actually going to sleep. However, it's probably smart to bet that > actually needing to sleep is fairly infrequent, just as in lwlock.c. > > Thoughts? FWIW, I agree to the assumption. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Nov 22, 2016 at 3:05 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Mon, 21 Nov 2016 15:57:47 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobFjwcFEiq8j+fvH5CdXHdVJffmemNLq8MqFesg2+4Gwg@mail.gmail.com> >> On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > So, in my >> > implementation, a condition variable wait loop looks like this: >> > >> > for (;;) >> > { >> > ConditionVariablePrepareToSleep(cv); >> > if (condition for which we are waiting is satisfied) >> > break; >> > ConditionVariableSleep(); >> > } >> > ConditionVariableCancelSleep(); >> >> I have what I think is a better idea. Let's get rid of >> ConditionVariablePrepareToSleep(cv) and instead tell users of this >> facility to write the loop this way: >> >> for (;;) >> { >> if (condition for which we are waiting is satisfied) >> break; >> ConditionVariableSleep(cv); >> } >> ConditionVariableCancelSleep(); > > It seems rather a common way to wait on a condition variable, in > shorter, > > | while (condition for which we are waiting is *not* satisfied) > | ConditionVariableSleep(cv); > | ConditionVariableCancelSleep(); Ok, let's show it that way. >> ConditionVariableSleep(cv) will check whether the current process is >> already on the condition variable's waitlist. If so, it will sleep; >> if not, it will add the process and return without sleeping. >> >> It may seem odd that ConditionVariableSleep(cv) doesn't necessary >> sleep, but this design has a significant advantage: we avoid >> manipulating the wait-list altogether in the case where the condition >> is already satisfied when we enter the loop. That's more like what we > > The condition check is done far faster than maintaining the > wait-list for most cases, I believe. > >> already do in lwlock.c: we try to grab the lock first; if we can't, we >> add ourselves to the wait-list and retry; if we then get the lock >> after all we have to recheck whether we can get the lock and remove >> ourselves from the wait-list if so. Of course, there is some cost: if >> we do have to wait, we'll end up checking the condition twice before >> actually going to sleep. However, it's probably smart to bet that >> actually needing to sleep is fairly infrequent, just as in lwlock.c. >> >> Thoughts? > > FWIW, I agree to the assumption. Here's a version that works that way, though it allows you to call ConditionVariablePrepareToSleep *optionally* before you enter your loop, in case you expect to have to wait and would rather avoid the extra loop. Maybe there isn't much point in exposing that though, since your condition test should be fast and waiting is the slow path, but we don't really really know what your condition test is. I thought about that because my use case (barrier.c) does in fact expect to hit the wait case more often than not. If that seems pointless then perhaps ConditionVariablePrepareToSleep should become static and implicit. This version does attempt to suppress spurious returns, a bit, using proclist_contains. No more cvSleeping. It's possible that future users will want a version with a timeout, or multiplexed with IO, in which case there would be some interesting questions about how this should interact with WaitEventSet. It also seems like someone might eventually want to handle postmaster death. Perhaps there shoul eventually be a way to tell WaitEventSet that you're waiting for a CV so these things can be multiplexed without exposing the fact that it's done internally with latches. -- Thomas Munro http://www.enterprisedb.com
Attachment
At Tue, 22 Nov 2016 17:48:07 +1300, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2VNfOq3spjSRGgM8WB+-PhfPXFB_adjizUUef9=cVDWQ@mail.gmail.com> > On Tue, Nov 22, 2016 at 3:05 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, > > > > At Mon, 21 Nov 2016 15:57:47 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobFjwcFEiq8j+fvH5CdXHdVJffmemNLq8MqFesg2+4Gwg@mail.gmail.com> > >> On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > So, in my > >> > implementation, a condition variable wait loop looks like this: > >> > > >> > for (;;) > >> > { > >> > ConditionVariablePrepareToSleep(cv); > >> > if (condition for which we are waiting is satisfied) > >> > break; > >> > ConditionVariableSleep(); > >> > } > >> > ConditionVariableCancelSleep(); > >> > >> I have what I think is a better idea. Let's get rid of > >> ConditionVariablePrepareToSleep(cv) and instead tell users of this > >> facility to write the loop this way: > >> > >> for (;;) > >> { > >> if (condition for which we are waiting is satisfied) > >> break; > >> ConditionVariableSleep(cv); > >> } > >> ConditionVariableCancelSleep(); > > > > It seems rather a common way to wait on a condition variable, in > > shorter, > > > > | while (condition for which we are waiting is *not* satisfied) > > | ConditionVariableSleep(cv); > > | ConditionVariableCancelSleep(); > > Ok, let's show it that way. > > >> ConditionVariableSleep(cv) will check whether the current process is > >> already on the condition variable's waitlist. If so, it will sleep; > >> if not, it will add the process and return without sleeping. > >> > >> It may seem odd that ConditionVariableSleep(cv) doesn't necessary > >> sleep, but this design has a significant advantage: we avoid > >> manipulating the wait-list altogether in the case where the condition > >> is already satisfied when we enter the loop. That's more like what we > > > > The condition check is done far faster than maintaining the > > wait-list for most cases, I believe. > > > >> already do in lwlock.c: we try to grab the lock first; if we can't, we > >> add ourselves to the wait-list and retry; if we then get the lock > >> after all we have to recheck whether we can get the lock and remove > >> ourselves from the wait-list if so. Of course, there is some cost: if > >> we do have to wait, we'll end up checking the condition twice before > >> actually going to sleep. However, it's probably smart to bet that > >> actually needing to sleep is fairly infrequent, just as in lwlock.c. > >> > >> Thoughts? > > > > FWIW, I agree to the assumption. > > Here's a version that works that way, though it allows you to call > ConditionVariablePrepareToSleep *optionally* before you enter your > loop, in case you expect to have to wait and would rather avoid the > extra loop. Maybe there isn't much point in exposing that though, > since your condition test should be fast and waiting is the slow path, > but we don't really really know what your condition test is. I > thought about that because my use case (barrier.c) does in fact expect > to hit the wait case more often than not. If that seems pointless > then perhaps ConditionVariablePrepareToSleep should become static and > implicit. This version does attempt to suppress spurious returns, a > bit, using proclist_contains. No more cvSleeping. Nice! > It's possible that future users will want a version with a timeout, or > multiplexed with IO, in which case there would be some interesting > questions about how this should interact with WaitEventSet. It also > seems like someone might eventually want to handle postmaster death. > Perhaps there shoul eventually be a way to tell WaitEventSet that > you're waiting for a CV so these things can be multiplexed without > exposing the fact that it's done internally with latches. Interesting. IMHO, returning on POSTMASTER_DEATH doesn't seem to harm ordinary use and might be useful right now. CVSleepTimeout() would be made sooner or later if someone needs. Multiplexed IO is apparently a matter of WaitEventSet. Waiting CV by WaitEventSet would be a matter of future. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Nov 21, 2016 at 11:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a version that works that way, though it allows you to call > ConditionVariablePrepareToSleep *optionally* before you enter your > loop, in case you expect to have to wait and would rather avoid the > extra loop. Maybe there isn't much point in exposing that though, > since your condition test should be fast and waiting is the slow path, > but we don't really really know what your condition test is. I > thought about that because my use case (barrier.c) does in fact expect > to hit the wait case more often than not. If that seems pointless > then perhaps ConditionVariablePrepareToSleep should become static and > implicit. This version does attempt to suppress spurious returns, a > bit, using proclist_contains. No more cvSleeping. This version looks good to me and I have committed it after doing a bit more work on the comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company