Re: Missed check for too-many-children in bgworker spawning - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Missed check for too-many-children in bgworker spawning
Date
Msg-id 20191104190440.7p5jg2ne76l6islt@alap3.anarazel.de
Whole thread Raw
In response to Re: Missed check for too-many-children in bgworker spawning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Missed check for too-many-children in bgworker spawning  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2019-11-04 12:14:53 -0500, Robert Haas wrote:
> If a process trying to register workers finds out that no worker slots
> are available, it discovers this at the time it tries to perform the
> registration. But fork() failure happens later and in a different
> process. The original process just finds out that the worker is
> "stopped," not whether or not it ever got started in the first
> place.

Is that really true? In the case where it started and failed we except
the error queue to have been attached to, and there to be either an
error 'E' or a 'X' response (cf HandleParallelMessage()).  It doesn't
strike me as very complicated to keep track of whether any worker has
sent an 'E' or not, no?  I don't think we really need the

Funny (?) anecdote: I learned about this part of the system recently,
after I had installed some crash handler inside postgres. Turns out that
that diverted, as a side-effect, SIGUSR1 to it's own signal handler. All
tests in the main regression tests passed, except for ones getting stuck
waiting for WaitForParallelWorkersToFinish(), which could be fixed by
disabling parallelism aggressively. Took me like two hours to
debug... Also, a bit sad that parallel query is the only visible
failure (in the main tests) of breaking the sigusr1 infrastructure...


> We certainly can't ignore a worker that managed to start and
> then bombed out, because it might've already, for example, claimed a
> block from a Parallel Seq Scan and not yet sent back the corresponding
> tuples. We could ignore a worker that never started at all, due to
> EAGAIN or whatever else, but the original process that registered the
> worker has no way of finding this out.

Sure, but in that case we'd have gotten either an error back from the
worker, or postmaster wouldhave PANIC restarted everyone due to an
unhandled error in the worker, no?


> And even if you solved for all of that, I think you might still find
> that it breaks some parallel query (or parallel create index) code
> that expects the number of workers to change at registration time, but
> not afterwards. So, that could would all need to be adjusted.

Fair enough. Although I think practically nearly everything has to be
ready to handle workers just being slow to start up anyway, no? There's
plenty cases where we just finish before all workers are getting around
to do work.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Wrong value in metapage of GIN INDEX.
Next
From: Andrew Gierth
Date:
Subject: Re: Excessive disk usage in WindowAgg