Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race conditions with checkpointer and shutdown
Date
Msg-id 1080.1555599180@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race conditions with checkpointer and shutdown  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Race conditions with checkpointer and shutdown
Re: Race conditions with checkpointer and shutdown
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think what we need to look for is reasons why (1) the postmaster
>> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
>> main loop doesn't get to noticing shutdown_requested.
>> 
>> A rather scary point for (2) is that said main loop seems to be
>> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
>> other purposes in the checkpointer process.  If there were something,
>> like say a condition variable wait, that would reset MyLatch at any
>> time during a checkpoint, then we could very easily go to sleep at the
>> bottom of the loop and not notice that there's a pending shutdown request.

> Agreed on the non-composability of that coding, but if there actually
> is anything in that loop that can reach ResetLatch(), it's well
> hidden...

Well, it's easy to see that there's no other ResetLatch call in
checkpointer.c.  It's much less obvious that there's no such
call anywhere in the code reachable from e.g. CreateCheckPoint().

To try to investigate that, I hacked things up to force an assertion
failure if ResetLatch was called from any other place in the
checkpointer process (dirty patch attached for amusement's sake).
This gets through check-world without any assertions.  That does not
really prove that there aren't corner timing cases where a latch
wait and reset could happen, but it does put a big dent in my theory.
Question is, what other theory has anybody got?

            regards, tom lane

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a1e0423..b0ee1fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -166,6 +166,8 @@ static double ckpt_cached_elapsed;
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;

+extern bool reset_latch_ok;
+
 /* Prototypes for private functions */

 static void CheckArchiveTimeout(void);
@@ -343,9 +345,13 @@ CheckpointerMain(void)
         int            elapsed_secs;
         int            cur_timeout;

+        reset_latch_ok = true;
+
         /* Clear any already-pending wakeups */
         ResetLatch(MyLatch);

+        reset_latch_ok = false;
+
         /*
          * Process any requests or signals received recently.
          */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917..1f0613d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -118,6 +118,8 @@ struct WaitEventSet
 #endif
 };

+bool reset_latch_ok = true;
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -521,6 +523,8 @@ ResetLatch(Latch *latch)
     /* Only the owner should reset the latch */
     Assert(latch->owner_pid == MyProcPid);

+    Assert(reset_latch_ok);
+
     latch->is_set = false;

     /*

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Question about the holdable cursor
Next
From: David Fetter
Date:
Subject: Re: block-level incremental backup