Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests |
Date | |
Msg-id | CA+TgmobAe7rR_KJKoGUt4jQsQujRrHNbJ=0Hymo380bbvn_1QA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
|
List | pgsql-hackers |
On Wed, Jun 14, 2017 at 3:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So the first problem here is the lack of supporting information for the > 'could not map' failure. Hmm. I think I believed at the time I wrote dsm_attach() that somebody might want to try to soldier on after failing to map a DSM, but that doesn't seem very likely any more. That theory is supported by this comment: /* * If we didn't find the handle we're looking for in the control segment, * it probably means that everyone elsewho had it mapped, including the * original creator, died before we got to this point. It's up to the * callerto decide what to do about that. */ But in practice, every current caller handles a failure here by bailing out. > AFAICS, this must mean either that dsm_attach() > returned without ever calling dsm_impl_op() at all, or that > dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or > ERROR_ACCESS_DENIED. It's far from clear why those cases should be > treated as a silent fail. There's a good reason for that, though. See 419113dfdc4c729f6c763cc30a9b02ee68a7da94. > It's even less clear why dsm_attach's early > exit cases don't produce any messages. But since we're left not knowing > what happened, the messaging design here is clearly inadequate. I don't know what you mean by this. The function only has one early-exit case, the comment for which I quoted above. > The subsequent assertion crash came from here: > > /* > * If this is a parallel worker, check whether there are already too many > * parallel workers; if so, don't register another one. Our view of > * parallel_terminate_count may be slightly stale, but that doesn't really > * matter: we would have gotten the same result if we'd arrived here > * slightly earlier anyway. There's no help for it, either, since the > * postmaster must not take locks; a memory barrier wouldn't guarantee > * anything useful. > */ > if (parallel && (BackgroundWorkerData->parallel_register_count - > BackgroundWorkerData->parallel_terminate_count) >= > max_parallel_workers) > { > Assert(BackgroundWorkerData->parallel_register_count - > BackgroundWorkerData->parallel_terminate_count <= > MAX_PARALLEL_WORKER_LIMIT); > LWLockRelease(BackgroundWorkerLock); > return false; > } > > What I suspect happened is that parallel_register_count was less than > parallel_terminate_count; since those counters are declared as uint32, > the negative difference would have been treated as a large unsigned value, > triggering the assertion. Right, and that's exactly the point of having that assertion. > It's not very clear how that happened, but my > bet is that the postmaster incremented parallel_terminate_count more than > once while cleaning up after the crashed worker. It looks to me like > there's nothing stopping BackgroundWorkerStateChange from incrementing it > and then the eventual ForgetBackgroundWorker call from incrementing it > again. I haven't traced through things to identify why this might only > occur in a worker-failure scenario, but surely we want to make sure that > the counter increment happens once and only once per worker. Yeah -- if that can happen, it's definitely a bug. > I'm also noting that ForgetBackgroundWorker is lacking a memory barrier > between incrementing parallel_terminate_count and marking the slot free. > Maybe it doesn't need one, but since there is one in > BackgroundWorkerStateChange, it looks weird to not have it here. I noticed that the other day and thought about mentioning it, but didn't quite muster up the energy. It seems unlikely to me to be the cause of any of the problems we're seeing, but I think it can't hurt to fix the omission. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: