Re: Condition variable live lock - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Condition variable live lock
Date
Msg-id 26266.1515272424@sss.pgh.pa.us
Whole thread Raw
In response to Re: Condition variable live lock  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Condition variable live lock
List pgsql-hackers
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> No opinion without seeing what you propose to change.

> OK, will put out a proposal.

I began with the intention of making no non-cosmetic changes, but then
I started to wonder why ConditionVariablePrepareToSleep bothers with a
!proclist_contains test, when the calling process surely ought not be
in the list -- or any other list -- since it wasn't prepared to sleep.
And that led me down a different rabbit hole ending in the conclusion
that proclist.h could stand some improvements too.  I do not like the
fact that it's impossible to tell whether a proclist_node is in any
proclist or not.  Initially, a proclist_node contains zeroes which is
a distinguishable state, but proclist_delete_offset resets it to
next = prev = INVALID_PGPROCNO which looks the same as a node that's in a
singleton list.  We should have it reset to the initial state of zeroes
instead, and then we can add assertions to proclist_push_xxx that the
supplied node is not already in a list.  Hence, I propose the first
attached patch which tightens things up in proclist.h and then removes
the !proclist_contains test in ConditionVariablePrepareToSleep; the
assertion in proclist_push_tail supersedes that.

The second attached patch is the cosmetic changes I want to make in
condition_variable.c/.h.

I still think that we ought to change the Asserts on cv_sleep_target in
ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
test-and-elog tests.  Those are checking a global correctness property
("global" meaning "interactions between completely unrelated modules can
break this"), and they'd be extremely cheap compared to the rest of what
those functions are doing, so I think insisting that they be Asserts is
penny wise and pound foolish.  Anybody besides Robert want to vote on
that?

Another loose end that I'm seeing here is that while a process waiting on
a condition variable will respond to a cancel or die interrupt, it will
not notice postmaster death.  This seems unwise to me.  I think we should
adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
condition and just do a summary proc_exit(1) if it sees that.  I'd even
argue that that is a back-patchable bug fix.

            regards, tom lane

diff --git a/src/include/storage/proclist_types.h b/src/include/storage/proclist_types.h
index 237fb76..f4dac10 100644
--- a/src/include/storage/proclist_types.h
+++ b/src/include/storage/proclist_types.h
@@ -16,7 +16,12 @@
 #define PROCLIST_TYPES_H

 /*
- * A node in a list of processes.
+ * A node in a doubly-linked list of processes.  The link fields contain
+ * the 0-based PGPROC indexes of the next and previous process, or
+ * INVALID_PGPROCNO in the next-link of the last node and the prev-link
+ * of the first node.  A node that is currently not in any list
+ * should have next == prev == 0; this is not a possible state for a node
+ * that is in a list, because we disallow circularity.
  */
 typedef struct proclist_node
 {
@@ -25,7 +30,8 @@ typedef struct proclist_node
 } proclist_node;

 /*
- * Head of a doubly-linked list of PGPROCs, identified by pgprocno.
+ * Header of a doubly-linked list of PGPROCs, identified by pgprocno.
+ * An empty list is represented by head == tail == INVALID_PGPROCNO.
  */
 typedef struct proclist_head
 {
@@ -42,4 +48,4 @@ typedef struct proclist_mutable_iter
     int            next;            /* pgprocno of the next PGPROC */
 } proclist_mutable_iter;

-#endif
+#endif                            /* PROCLIST_TYPES_H */
diff --git a/src/include/storage/proclist.h b/src/include/storage/proclist.h
index 4e25b47..59a478e 100644
--- a/src/include/storage/proclist.h
+++ b/src/include/storage/proclist.h
@@ -42,7 +42,7 @@ proclist_is_empty(proclist_head *list)

 /*
  * Get a pointer to a proclist_node inside a given PGPROC, given a procno and
- * an offset.
+ * the proclist_node field's offset within struct PGPROC.
  */
 static inline proclist_node *
 proclist_node_get(int procno, size_t node_offset)
@@ -53,13 +53,15 @@ proclist_node_get(int procno, size_t node_offset)
 }

 /*
- * Insert a node at the beginning of a list.
+ * Insert a process at the beginning of a list.
  */
 static inline void
 proclist_push_head_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next == 0 && node->prev == 0);
+
     if (list->head == INVALID_PGPROCNO)
     {
         Assert(list->tail == INVALID_PGPROCNO);
@@ -79,13 +81,15 @@ proclist_push_head_offset(proclist_head *list, int procno, size_t node_offset)
 }

 /*
- * Insert a node at the end of a list.
+ * Insert a process at the end of a list.
  */
 static inline void
 proclist_push_tail_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next == 0 && node->prev == 0);
+
     if (list->tail == INVALID_PGPROCNO)
     {
         Assert(list->head == INVALID_PGPROCNO);
@@ -105,30 +109,38 @@ proclist_push_tail_offset(proclist_head *list, int procno, size_t node_offset)
 }

 /*
- * Delete a node.  The node must be in the list.
+ * Delete a process from a list --- it must be in the list!
  */
 static inline void
 proclist_delete_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next != 0 || node->prev != 0);
+
     if (node->prev == INVALID_PGPROCNO)
+    {
+        Assert(list->head == procno);
         list->head = node->next;
+    }
     else
         proclist_node_get(node->prev, node_offset)->next = node->next;

     if (node->next == INVALID_PGPROCNO)
+    {
+        Assert(list->tail == procno);
         list->tail = node->prev;
+    }
     else
         proclist_node_get(node->next, node_offset)->prev = node->prev;

-    node->next = node->prev = INVALID_PGPROCNO;
+    node->next = node->prev = 0;
 }

 /*
- * Check if a node is currently in a list.  It must be known that the node is
- * not in any _other_ proclist that uses the same proclist_node, so that the
- * only possibilities are that it is in this list or none.
+ * Check if a process is currently in a list.  It must be known that the
+ * process is not in any _other_ proclist that uses the same proclist_node,
+ * so that the only possibilities are that it is in this list or none.
  */
 static inline bool
 proclist_contains_offset(proclist_head *list, int procno,
@@ -136,27 +148,26 @@ proclist_contains_offset(proclist_head *list, int procno,
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

-    /*
-     * If this is not a member of a proclist, then the next and prev pointers
-     * should be 0. Circular lists are not allowed so this condition is not
-     * confusable with a real pgprocno 0.
-     */
+    /* If it's not in any list, it's definitely not in this one. */
     if (node->prev == 0 && node->next == 0)
         return false;

-    /* If there is a previous node, then this node must be in the list. */
-    if (node->prev != INVALID_PGPROCNO)
-        return true;
-
     /*
-     * There is no previous node, so the only way this node can be in the list
-     * is if it's the head node.
+     * It must, in fact, be in this list.  Ideally, in assert-enabled builds,
+     * we'd verify that.  But since this function is typically used while
+     * holding a spinlock, crawling the whole list is unacceptable.  However,
+     * we can verify matters in O(1) time when the node is a list head or
+     * tail, and that seems worth doing, since in practice that should often
+     * be enough to catch mistakes.
      */
-    return list->head == procno;
+    Assert(node->prev != INVALID_PGPROCNO || list->head == procno);
+    Assert(node->next != INVALID_PGPROCNO || list->tail == procno);
+
+    return true;
 }

 /*
- * Remove and return the first node from a list (there must be one).
+ * Remove and return the first process from a list (there must be one).
  */
 static inline PGPROC *
 proclist_pop_head_node_offset(proclist_head *list, size_t node_offset)
@@ -205,4 +216,4 @@ proclist_pop_head_node_offset(proclist_head *list, size_t node_offset)
              proclist_node_get((iter).cur,                                    \
                                offsetof(PGPROC, link_member))->next)

-#endif
+#endif                            /* PROCLIST_H */
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 60234db..e3bc034 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -86,8 +86,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)

     /* Add myself to the wait queue. */
     SpinLockAcquire(&cv->mutex);
-    if (!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink))
-        proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
+    proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
     SpinLockRelease(&cv->mutex);
 }

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index e3bc034..0b9d676 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -43,11 +43,17 @@ ConditionVariableInit(ConditionVariable *cv)
 }

 /*
- * Prepare to wait on a given condition variable.  This can optionally be
- * called before entering a test/sleep loop.  Alternatively, the call to
- * ConditionVariablePrepareToSleep can be omitted.  The only advantage of
- * calling ConditionVariablePrepareToSleep is that it avoids an initial
- * double-test of the user's predicate in the case that we need to wait.
+ * Prepare to wait on a given condition variable.
+ *
+ * This can optionally be called before entering a test/sleep loop.
+ * Doing so is more efficient if we'll need to sleep at least once.
+ * However, if the first test of the exit condition is likely to succeed,
+ * it's more efficient to omit the ConditionVariablePrepareToSleep call.
+ * See comments in ConditionVariableSleep for more detail.
+ *
+ * Only one condition variable can be used at a time, ie,
+ * ConditionVariableCancelSleep must be called before any attempt is made
+ * to sleep on a different condition variable.
  */
 void
 ConditionVariablePrepareToSleep(ConditionVariable *cv)
@@ -79,8 +85,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     cv_sleep_target = cv;

     /*
-     * Reset my latch before adding myself to the queue and before entering
-     * the caller's predicate loop.
+     * Reset my latch before adding myself to the queue, to ensure that we
+     * don't miss a wakeup that occurs immediately.
      */
     ResetLatch(MyLatch);

@@ -90,20 +96,21 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     SpinLockRelease(&cv->mutex);
 }

-/*--------------------------------------------------------------------------
- * Wait for the given condition variable to be signaled.  This should be
- * called in a predicate loop that tests for a specific exit condition and
- * otherwise sleeps, like so:
+/*
+ * Wait for the given condition variable to be signaled.
  *
- *     ConditionVariablePrepareToSleep(cv); [optional]
+ * This should be called in a predicate loop that tests for a specific exit
+ * condition and otherwise sleeps, like so:
+ *
+ *     ConditionVariablePrepareToSleep(cv);  // optional
  *     while (condition for which we are waiting is not true)
  *         ConditionVariableSleep(cv, wait_event_info);
  *     ConditionVariableCancelSleep();
  *
- * Supply a value from one of the WaitEventXXX enums defined in pgstat.h to
- * control the contents of pg_stat_activity's wait_event_type and wait_event
- * columns while waiting.
- *-------------------------------------------------------------------------*/
+ * wait_event_info should be a value from one of the WaitEventXXX enums
+ * defined in pgstat.h.  This controls the contents of pg_stat_activity's
+ * wait_event_type and wait_event columns while waiting.
+ */
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
@@ -113,13 +120,14 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
     /*
      * If the caller didn't prepare to sleep explicitly, then do so now and
      * return immediately.  The caller's predicate loop should immediately
-     * call again if its exit condition is not yet met.  This initial spurious
-     * return can be avoided by calling ConditionVariablePrepareToSleep(cv)
+     * call again if its exit condition is not yet met.  This will result in
+     * the exit condition being tested twice before we first sleep.  The extra
+     * test can be prevented by calling ConditionVariablePrepareToSleep(cv)
      * first.  Whether it's worth doing that depends on whether you expect the
-     * condition to be met initially, in which case skipping the prepare
-     * allows you to skip manipulation of the wait list, or not met initially,
-     * in which case preparing first allows you to skip a spurious test of the
-     * caller's exit condition.
+     * condition to be met initially, in which case skipping the prepare is
+     * recommended because it avoids manipulations of the wait list, or not
+     * met initially, in which case preparing first is better because it
+     * avoids one extra test of the exit condition.
      */
     if (cv_sleep_target == NULL)
     {
@@ -130,7 +138,7 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
     /* Any earlier condition variable sleep must have been canceled. */
     Assert(cv_sleep_target == cv);

-    while (!done)
+    do
     {
         CHECK_FOR_INTERRUPTS();

@@ -140,18 +148,23 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
          */
         WaitEventSetWait(cv_wait_event_set, -1, &event, 1, wait_event_info);

-        /* Reset latch before testing whether we can return. */
+        /* Reset latch before examining the state of the wait list. */
         ResetLatch(MyLatch);

         /*
          * If this process has been taken out of the wait list, then we know
-         * that is has been signaled by ConditionVariableSignal.  We put it
-         * back into the wait list, so we don't miss any further signals while
-         * the caller's loop checks its condition.  If it hasn't been taken
-         * out of the wait list, then the latch must have been set by
-         * something other than ConditionVariableSignal; though we don't
-         * guarantee not to return spuriously, we'll avoid these obvious
-         * cases.
+         * that it has been signaled by ConditionVariableSignal (or
+         * ConditionVariableBroadcast), so we should return to the caller. But
+         * that doesn't guarantee that the exit condition is met, only that we
+         * ought to check it.  So we must put the process back into the wait
+         * list, to ensure we don't miss any additional wakeup occurring while
+         * the caller checks its exit condition.  We can take ourselves out of
+         * the wait list only when the caller calls
+         * ConditionVariableCancelSleep.
+         *
+         * If we're still in the wait list, then the latch must have been set
+         * by something other than ConditionVariableSignal; though we don't
+         * guarantee not to return spuriously, we'll avoid this obvious case.
          */
         SpinLockAcquire(&cv->mutex);
         if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
@@ -160,13 +173,17 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
             proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
         }
         SpinLockRelease(&cv->mutex);
-    }
+    } while (!done);
 }

 /*
- * Cancel any pending sleep operation.  We just need to remove ourselves
- * from the wait queue of any condition variable for which we have previously
- * prepared a sleep.
+ * Cancel any pending sleep operation.
+ *
+ * We just need to remove ourselves from the wait queue of any condition
+ * variable for which we have previously prepared a sleep.
+ *
+ * Do nothing if nothing is pending; this allows this function to be called
+ * during transaction abort to clean up any unfinished CV sleep.
  */
 void
 ConditionVariableCancelSleep(void)
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index c7afbbc..7dac477 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -27,30 +27,33 @@

 typedef struct
 {
-    slock_t        mutex;
-    proclist_head wakeup;
+    slock_t        mutex;            /* spinlock protecting the wakeup list */
+    proclist_head wakeup;        /* list of wake-able processes */
 } ConditionVariable;

 /* Initialize a condition variable. */
-extern void ConditionVariableInit(ConditionVariable *);
+extern void ConditionVariableInit(ConditionVariable *cv);

 /*
  * To sleep on a condition variable, a process should use a loop which first
  * checks the condition, exiting the loop if it is met, and then calls
  * ConditionVariableSleep.  Spurious wakeups are possible, but should be
- * infrequent.  After exiting the loop, ConditionVariableCancelSleep should
+ * infrequent.  After exiting the loop, ConditionVariableCancelSleep must
  * be called to ensure that the process is no longer in the wait list for
- * the condition variable.
+ * the condition variable.  Only one condition variable can be used at a
+ * time, ie, ConditionVariableCancelSleep must be called before any attempt
+ * is made to sleep on a different condition variable.
  */
-extern void ConditionVariableSleep(ConditionVariable *, uint32 wait_event_info);
+extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);

 /*
- * The use of this function is optional and not necessary for correctness;
- * for efficiency, it should be called prior entering the loop described above
- * if it is thought that the condition is unlikely to hold immediately.
+ * Optionally, ConditionVariablePrepareToSleep can be called before entering
+ * the test-and-sleep loop described above.  Doing so is more efficient if
+ * at least one sleep is needed, whereas not doing so is more efficient when
+ * no sleep is needed because the test condition is true the first time.
  */
-extern void ConditionVariablePrepareToSleep(ConditionVariable *);
+extern void ConditionVariablePrepareToSleep(ConditionVariable *cv);

 /* Wake up a single waiter (via signal) or all waiters (via broadcast). */
 extern void ConditionVariableSignal(ConditionVariable *cv);

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
Next
From: Stephen Frost
Date:
Subject: Re: Add default role 'pg_access_server_files'