Re: Race conditions with checkpointer and shutdown - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Race conditions with checkpointer and shutdown |
Date | |
Msg-id | 1080.1555599180@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Race conditions with checkpointer and shutdown (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Race conditions with checkpointer and shutdown
Re: Race conditions with checkpointer and shutdown |
List | pgsql-hackers |
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think what we need to look for is reasons why (1) the postmaster >> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's >> main loop doesn't get to noticing shutdown_requested. >> >> A rather scary point for (2) is that said main loop seems to be >> assuming that MyLatch a/k/a MyProc->procLatch is not used for any >> other purposes in the checkpointer process. If there were something, >> like say a condition variable wait, that would reset MyLatch at any >> time during a checkpoint, then we could very easily go to sleep at the >> bottom of the loop and not notice that there's a pending shutdown request. > Agreed on the non-composability of that coding, but if there actually > is anything in that loop that can reach ResetLatch(), it's well > hidden... Well, it's easy to see that there's no other ResetLatch call in checkpointer.c. It's much less obvious that there's no such call anywhere in the code reachable from e.g. CreateCheckPoint(). To try to investigate that, I hacked things up to force an assertion failure if ResetLatch was called from any other place in the checkpointer process (dirty patch attached for amusement's sake). This gets through check-world without any assertions. That does not really prove that there aren't corner timing cases where a latch wait and reset could happen, but it does put a big dent in my theory. Question is, what other theory has anybody got? regards, tom lane diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index a1e0423..b0ee1fd 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -166,6 +166,8 @@ static double ckpt_cached_elapsed; static pg_time_t last_checkpoint_time; static pg_time_t last_xlog_switch_time; +extern bool reset_latch_ok; + /* Prototypes for private functions */ static void CheckArchiveTimeout(void); @@ -343,9 +345,13 @@ CheckpointerMain(void) int elapsed_secs; int cur_timeout; + reset_latch_ok = true; + /* Clear any already-pending wakeups */ ResetLatch(MyLatch); + reset_latch_ok = false; + /* * Process any requests or signals received recently. */ diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 59fa917..1f0613d 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -118,6 +118,8 @@ struct WaitEventSet #endif }; +bool reset_latch_ok = true; + #ifndef WIN32 /* Are we currently in WaitLatch? The signal handler would like to know. */ static volatile sig_atomic_t waiting = false; @@ -521,6 +523,8 @@ ResetLatch(Latch *latch) /* Only the owner should reset the latch */ Assert(latch->owner_pid == MyProcPid); + Assert(reset_latch_ok); + latch->is_set = false; /*
pgsql-hackers by date: