Thread: Missed check for too-many-children in bgworker spawning

Missed check for too-many-children in bgworker spawning

From

Tom Lane

Date:

06 October 2019, 20:17:37

Over in [1] we have a report of a postmaster shutdown that seems to
have occurred because some client logic was overaggressively spawning
connection requests, causing the postmaster's child-process arrays to
be temporarily full, and then some parallel query tried to launch a
new bgworker process.  The postmaster's bgworker-spawning logic lacks
any check for the arrays being full, so when AssignPostmasterChildSlot
failed to find a free slot, kaboom!

The attached proposed patch fixes this by making bgworker spawning
include a canAcceptConnections() test.  That's perhaps overkill, since
we really just need to check the CountChildren() total; but I judged
that going through that function and having it decide what to test or
not test was a better design than duplicating the CountChildren() test
elsewhere.

I'd first imagined also replacing the one-size-fits-all check

    if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
        result = CAC_TOOMANY;

with something like

    switch (backend_type)
    {
        case BACKEND_TYPE_NORMAL:
            if (CountChildren(backend_type) >= 2 * MaxConnections)
                result = CAC_TOOMANY;
            break;
        case BACKEND_TYPE_AUTOVAC:
            if (CountChildren(backend_type) >= 2 * autovacuum_max_workers)
                result = CAC_TOOMANY;
            break;
        ...
    }

so as to subdivide the pool of child-process slots and prevent client
requests from consuming slots meant for background processes.  But on
closer examination that's not really worth the trouble, because this
pool is already considerably bigger than MaxBackends; so even if we
prevented a failure here we could still have bgworker startup failure
later on when it tries to acquire a PGPROC.

Barring objections, I'll apply and back-patch this soon.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CADCf-WNZk_9680Q0YjfBzuiR0Oe8LzvDs2Ts3_tq6Tv1e8raQQ%40mail.gmail.com

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eb9e022..7947c96 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,7 +409,7 @@ static void SendNegotiateProtocolVersion(List *unrecognized_protocol_options);
 static void processCancelRequest(Port *port, void *pkt);
 static int    initMasks(fd_set *rmask);
 static void report_fork_failure_to_client(Port *port, int errnum);
-static CAC_state canAcceptConnections(void);
+static CAC_state canAcceptConnections(int backend_type);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
@@ -2401,13 +2401,15 @@ processCancelRequest(Port *port, void *pkt)
  * canAcceptConnections --- check to see if database state allows connections.
  */
 static CAC_state
-canAcceptConnections(void)
+canAcceptConnections(int backend_type)
 {
     CAC_state    result = CAC_OK;

     /*
      * Can't start backends when in startup/shutdown/inconsistent recovery
-     * state.
+     * state.  We treat autovac workers the same as user backends for this
+     * purpose.  However, bgworkers are excluded from this test; we expect
+     * bgworker_should_start_now() decided whether the DB state allows them.
      *
      * In state PM_WAIT_BACKUP only superusers can connect (this must be
      * allowed so that a superuser can end online backup mode); we return
@@ -2415,7 +2417,8 @@ canAcceptConnections(void)
      * that neither CAC_OK nor CAC_WAITBACKUP can safely be returned until we
      * have checked for too many children.
      */
-    if (pmState != PM_RUN)
+    if (pmState != PM_RUN &&
+        backend_type != BACKEND_TYPE_BGWORKER)
     {
         if (pmState == PM_WAIT_BACKUP)
             result = CAC_WAITBACKUP;    /* allow superusers only */
@@ -2435,9 +2438,9 @@ canAcceptConnections(void)
     /*
      * Don't start too many children.
      *
-     * We allow more connections than we can have backends here because some
+     * We allow more connections here than we can have backends because some
      * might still be authenticating; they might fail auth, or some existing
-     * backend might exit before the auth cycle is completed. The exact
+     * backend might exit before the auth cycle is completed.  The exact
      * MaxBackends limit is enforced when a new backend tries to join the
      * shared-inval backend array.
      *
@@ -4114,7 +4117,7 @@ BackendStartup(Port *port)
     bn->cancel_key = MyCancelKey;

     /* Pass down canAcceptConnections state */
-    port->canAcceptConnections = canAcceptConnections();
+    port->canAcceptConnections = canAcceptConnections(BACKEND_TYPE_NORMAL);
     bn->dead_end = (port->canAcceptConnections != CAC_OK &&
                     port->canAcceptConnections != CAC_WAITBACKUP);

@@ -5486,7 +5489,7 @@ StartAutovacuumWorker(void)
      * we have to check to avoid race-condition problems during DB state
      * changes.
      */
-    if (canAcceptConnections() == CAC_OK)
+    if (canAcceptConnections(BACKEND_TYPE_AUTOVAC) == CAC_OK)
     {
         /*
          * Compute the cancel key that will be assigned to this session. We
@@ -5863,6 +5866,19 @@ assign_backendlist_entry(RegisteredBgWorker *rw)
     Backend    *bn;

     /*
+     * Check that database state allows another connection.  Currently the
+     * only possible failure is CAC_TOOMANY, so we just log an error message
+     * based on that rather than checking the error code precisely.
+     */
+    if (canAcceptConnections(BACKEND_TYPE_BGWORKER) != CAC_OK)
+    {
+        ereport(LOG,
+                (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                 errmsg("no slot available for new worker process")));
+        return false;
+    }
+
+    /*
      * Compute the cancel key that will be assigned to this session. We
      * probably don't need cancel keys for background workers, but we'd better
      * have something random in the field to prevent unfriendly people from

Re: Missed check for too-many-children in bgworker spawning

From

Robert Haas

Date:

07 October 2019, 22:55:58

On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Over in [1] we have a report of a postmaster shutdown that seems to
> have occurred because some client logic was overaggressively spawning
> connection requests, causing the postmaster's child-process arrays to
> be temporarily full, and then some parallel query tried to launch a
> new bgworker process.  The postmaster's bgworker-spawning logic lacks
> any check for the arrays being full, so when AssignPostmasterChildSlot
> failed to find a free slot, kaboom!
>
> The attached proposed patch fixes this by making bgworker spawning
> include a canAcceptConnections() test.  That's perhaps overkill, since
> we really just need to check the CountChildren() total; but I judged
> that going through that function and having it decide what to test or
> not test was a better design than duplicating the CountChildren() test
> elsewhere.
>
> I'd first imagined also replacing the one-size-fits-all check
>
>     if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
>         result = CAC_TOOMANY;
>
> with something like
>
>     switch (backend_type)
>     {
>         case BACKEND_TYPE_NORMAL:
>             if (CountChildren(backend_type) >= 2 * MaxConnections)
>                 result = CAC_TOOMANY;
>             break;
>         case BACKEND_TYPE_AUTOVAC:
>             if (CountChildren(backend_type) >= 2 * autovacuum_max_workers)
>                 result = CAC_TOOMANY;
>             break;
>         ...
>     }
>
> so as to subdivide the pool of child-process slots and prevent client
> requests from consuming slots meant for background processes.  But on
> closer examination that's not really worth the trouble, because this
> pool is already considerably bigger than MaxBackends; so even if we
> prevented a failure here we could still have bgworker startup failure
> later on when it tries to acquire a PGPROC.
>
> Barring objections, I'll apply and back-patch this soon.

I think it used to work this way -- not sure if it was ever committed
this way, but it at least did during development -- and we ripped it
out because somebody (Magnus?) pointed out that if you got close to
the connection limit, you could see parallel queries start failing,
and that would suck. Falling back to non-parallel seems more OK in
that situation than actually failing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed check for too-many-children in bgworker spawning

From

Tom Lane

Date:

07 October 2019, 23:03:48

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Oct 6, 2019 at 1:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The attached proposed patch fixes this by making bgworker spawning
>> include a canAcceptConnections() test.

> I think it used to work this way -- not sure if it was ever committed
> this way, but it at least did during development -- and we ripped it
> out because somebody (Magnus?) pointed out that if you got close to
> the connection limit, you could see parallel queries start failing,
> and that would suck. Falling back to non-parallel seems more OK in
> that situation than actually failing.

I'm not following your point?  Whatever you might think the appropriate
response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not
it.  Moreover, we have to --- and already do, I trust --- deal with
other resource-exhaustion errors in exactly the same code path, notably
fork(2) failure which we simply can't predict or prevent.  Doesn't the
parallel query logic already deal sanely with failure to obtain as many
workers as it wanted?

            regards, tom lane

Re: Missed check for too-many-children in bgworker spawning

From

Robert Haas

Date:

09 October 2019, 17:10:14

On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not following your point?  Whatever you might think the appropriate
> response is, I'm pretty sure "elog(FATAL) out of the postmaster" is not
> it.  Moreover, we have to --- and already do, I trust --- deal with
> other resource-exhaustion errors in exactly the same code path, notably
> fork(2) failure which we simply can't predict or prevent.  Doesn't the
> parallel query logic already deal sanely with failure to obtain as many
> workers as it wanted?

If we fail to obtain workers because there are not adequate workers
slots available, parallel query deals with that smoothly.  But, once
we have a slot, any further failure will trigger the parallel query to
ERROR out.  For the case where we get a slot but can't start the
worker process, see WaitForParallelWorkersToFinish and/or
WaitForParallelWorkersToAttach and comments therein. Once we're
attached, any error messages thrown by the worker are propagated back
to the master; see HandleParallelMessages and pq_redirect_to_shm_mq.

Now you could argue that the master ought to selectively ignore
certain kinds of errors and just continue on, while rethrowing others,
say based on the errcode(). Such design ideas have been roundly panned
in other contexts, though, so I'm not sure it would be a great idea to
do it here either. But in any case, it's not how the current system
behaves, or was designed to behave.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed check for too-many-children in bgworker spawning

From

Tom Lane

Date:

09 October 2019, 17:21:15

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 7, 2019 at 4:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... Moreover, we have to --- and already do, I trust --- deal with
>> other resource-exhaustion errors in exactly the same code path, notably
>> fork(2) failure which we simply can't predict or prevent.  Doesn't the
>> parallel query logic already deal sanely with failure to obtain as many
>> workers as it wanted?

> If we fail to obtain workers because there are not adequate workers
> slots available, parallel query deals with that smoothly.  But, once
> we have a slot, any further failure will trigger the parallel query to
> ERROR out.

Well, that means we have a not-very-stable system then.

We could improve on matters so far as the postmaster's child-process
arrays are concerned, by defining separate slot "pools" for the different
types of child processes.  But I don't see much point if the code is
not prepared to recover from a fork() failure --- and if it is, that
would a fortiori deal with out-of-child-slots as well.

            regards, tom lane

Re: Missed check for too-many-children in bgworker spawning

From

Robert Haas

Date:

09 October 2019, 19:29:18

On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, that means we have a not-very-stable system then.
>
> We could improve on matters so far as the postmaster's child-process
> arrays are concerned, by defining separate slot "pools" for the different
> types of child processes.  But I don't see much point if the code is
> not prepared to recover from a fork() failure --- and if it is, that
> would a fortiori deal with out-of-child-slots as well.

I would say rather that if fork() is failing on your system, you have
a not very stable system. The fact that parallel query is going to
fail is sad, but not as sad as the fact that connecting to the
database is also going to fail, and that logging into the system to
try to fix the problem may well fail as well. Code that tries to make
parallel query cope with this situation without an error wouldn't
often be tested, so it might be buggy, and it wouldn't necessarily be
a benefit if it did work. I expect many people would rather have the
query fail and free up slots in the system process table than consume
precisely all of them and then try to execute the query at a
slower-than-expected rate.

Anyway, here's some previous discussion on this topic for your consideration:

https://www.postgresql.org/message-id/flat/CAKJS1f_6H2Gh3QyORyRP%2BG3YB3gZiNms_8QdtO5gvitfY5N9ig%40mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed check for too-many-children in bgworker spawning

From

Tom Lane

Date:

10 October 2019, 01:26:58

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could improve on matters so far as the postmaster's child-process
>> arrays are concerned, by defining separate slot "pools" for the different
>> types of child processes.  But I don't see much point if the code is
>> not prepared to recover from a fork() failure --- and if it is, that
>> would a fortiori deal with out-of-child-slots as well.

> I would say rather that if fork() is failing on your system, you have
> a not very stable system. The fact that parallel query is going to
> fail is sad, but not as sad as the fact that connecting to the
> database is also going to fail, and that logging into the system to
> try to fix the problem may well fail as well.

True, it's not a situation you especially want to be in.  However,
I've lost count of the number of times that I've heard someone talk
about how their system was overstressed to the point that everything
else was failing, but Postgres kept chugging along.  That's a good
reputation to have and we shouldn't just walk away from it.

> Code that tries to make
> parallel query cope with this situation without an error wouldn't
> often be tested, so it might be buggy, and it wouldn't necessarily be
> a benefit if it did work. I expect many people would rather have the
> query fail and free up slots in the system process table than consume
> precisely all of them and then try to execute the query at a
> slower-than-expected rate.

I find that argument to be utter bunkum.  The parallel query code is
*already* designed to silently degrade performance when its primary
resource limit (shared bgworker slots) is exhausted.  How can it be
all right to do that but not all right to cope with fork failure
similarly?  If we think running up against the kernel limits is a
case that we can roll over and die on, why don't we rip out the
virtual-FD stuff in fd.c?

As for "might be buggy", if we ripped out every part of Postgres
that's under-tested, I'm afraid there might not be much left.
In any case, a sane design for this would make as much as possible
of the code handle "out of shared bgworker slots" just the same as
resource failures later on, so that there wouldn't be that big a gap
in coverage.

Having said all that, I made a patch that causes the postmaster
to reserve separate child-process-array slots for autovac workers
and bgworkers, as per attached, so that excessive connection
requests can't DOS those subsystems.  But I'm not sure that it's
worth the complication; it wouldn't be necessary if the parallel
query launch code were more robust.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 85f15a5..8c66578 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2441,17 +2441,43 @@ canAcceptConnections(int backend_type)
     /*
      * Don't start too many children.
      *
-     * We allow more connections here than we can have backends because some
-     * might still be authenticating; they might fail auth, or some existing
-     * backend might exit before the auth cycle is completed.  The exact
-     * MaxBackends limit is enforced when a new backend tries to join the
-     * shared-inval backend array.
+     * We allow more connections (including walsenders) here than we can have
+     * backends because some might still be authenticating; they might fail
+     * auth, or some existing backend might exit before the auth cycle is
+     * completed.  The exact MaxBackends limit is enforced when a new backend
+     * tries to join the shared-inval backend array.
      *
-     * The limit here must match the sizes of the per-child-process arrays;
-     * see comments for MaxLivePostmasterChildren().
+     * On the other hand, there's no need to allow extra autovac workers or
+     * bgworkers, because the requesting logic for those should never request
+     * more processes than the corresponding limit.
+     *
+     * This logic must agree with MaxLivePostmasterChildren(), which sets the
+     * total size of the per-child-process arrays.
+     *
+     * Note: currently, we can't see backend_type == BACKEND_TYPE_WALSND here,
+     * but it seems less weird to include it in the switch than not.  We must
+     * treat normal backends and walsenders alike for this purpose because one
+     * type can metamorphose into the other later.
      */
-    if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
-        result = CAC_TOOMANY;
+    switch (backend_type)
+    {
+        case BACKEND_TYPE_NORMAL:
+        case BACKEND_TYPE_WALSND:
+            if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND) >=
+                2 * (MaxConnections + max_wal_senders))
+                result = CAC_TOOMANY;
+            break;
+        case BACKEND_TYPE_AUTOVAC:
+            if (CountChildren(BACKEND_TYPE_AUTOVAC) >= autovacuum_max_workers)
+                result = CAC_TOOMANY;
+            break;
+        case BACKEND_TYPE_BGWORKER:
+            if (CountChildren(BACKEND_TYPE_BGWORKER) >= max_worker_processes)
+                result = CAC_TOOMANY;
+            break;
+        default:
+            elog(ERROR, "unexpected backend_type: %d", backend_type);
+    }

     return result;
 }
@@ -5334,8 +5360,18 @@ static int
 CountChildren(int target)
 {
     dlist_iter    iter;
+    bool        check_walsender;
     int            cnt = 0;

+    /*
+     * In common cases we don't need to distinguish normal from walsender
+     * children, so we can save a few cycles and shared-memory touches by not
+     * checking for walsenders if not.
+     */
+    check_walsender =
+        (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != 0 &&
+        (target & (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND)) != (BACKEND_TYPE_NORMAL | BACKEND_TYPE_WALSND);
+
     dlist_foreach(iter, &BackendList)
     {
         Backend    *bp = dlist_container(Backend, elem, iter.cur);
@@ -5343,11 +5379,7 @@ CountChildren(int target)
         if (bp->dead_end)
             continue;

-        /*
-         * Since target == BACKEND_TYPE_ALL is the most common case, we test
-         * it first and avoid touching shared memory for every child.
-         */
-        if (target != BACKEND_TYPE_ALL)
+        if (check_walsender)
         {
             /*
              * Assign bkend_type for any recently announced WAL Sender
@@ -5356,11 +5388,11 @@ CountChildren(int target)
             if (bp->bkend_type == BACKEND_TYPE_NORMAL &&
                 IsPostmasterChildWalSender(bp->child_slot))
                 bp->bkend_type = BACKEND_TYPE_WALSND;
-
-            if (!(target & bp->bkend_type))
-                continue;
         }

+        if (!(target & bp->bkend_type))
+            continue;
+
         cnt++;
     }
     return cnt;
@@ -5626,15 +5658,31 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname)
  * (the PMChildFlags array, and if EXEC_BACKEND the ShmemBackendArray).
  * These arrays include regular backends, autovac workers, walsenders
  * and background workers, but not special children nor dead_end children.
- * This allows the arrays to have a fixed maximum size, to wit the same
- * too-many-children limit enforced by canAcceptConnections().  The exact value
- * isn't too critical as long as it's more than MaxBackends.
+ * This allows the arrays to have a fixed maximum size.  The total amount
+ * of space we reserve here mustn't be less than the number of child processes
+ * that canAcceptConnections() will allow.
  */
 int
 MaxLivePostmasterChildren(void)
 {
-    return 2 * (MaxConnections + autovacuum_max_workers + 1 +
-                max_wal_senders + max_worker_processes);
+    int            nchildren = 0;
+
+    /*
+     * For children that are launched from asynchronous connection requests,
+     * allow some extra space in the arrays so that we don't deny a connection
+     * that would have succeeded by the time it is ready to join the shared-
+     * memory data structures.  The 2X multiplier is arbitrary, and probably
+     * overkill, but the amount of space wasted is minimal.
+     */
+    nchildren += 2 * (MaxConnections + max_wal_senders);
+
+    /* Allow for the autovac launcher and its workers */
+    nchildren += 1 + autovacuum_max_workers;
+
+    /* Allow for bgworkers */
+    nchildren += max_worker_processes;
+
+    return nchildren;
 }

 /*

Re: Missed check for too-many-children in bgworker spawning

From

Alvaro Herrera

Date:

04 November 2019, 18:42:19

On 2019-Oct-09, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> We could improve on matters so far as the postmaster's child-process
> >> arrays are concerned, by defining separate slot "pools" for the different
> >> types of child processes.  But I don't see much point if the code is
> >> not prepared to recover from a fork() failure --- and if it is, that
> >> would a fortiori deal with out-of-child-slots as well.
> 
> > I would say rather that if fork() is failing on your system, you have
> > a not very stable system. The fact that parallel query is going to
> > fail is sad, but not as sad as the fact that connecting to the
> > database is also going to fail, and that logging into the system to
> > try to fix the problem may well fail as well.
> 
> True, it's not a situation you especially want to be in.  However,
> I've lost count of the number of times that I've heard someone talk
> about how their system was overstressed to the point that everything
> else was failing, but Postgres kept chugging along.  That's a good
> reputation to have and we shouldn't just walk away from it.

I agree with this point in principle.  Everything else (queries,
checkpointing) can fail, but it's critical that postmaster continues to
run -- that way, once the high load episode is over, connections can be
re-established as needed, auxiliary processes can be re-launched, and
the system can be again working normally.  If postmaster dies, all bets
are off.  Also: an idle postmaster is not using any resources; on its
own, killing it or it dying would not free any useful resources for the
system load to be back to low again.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missed check for too-many-children in bgworker spawning

From

Robert Haas

Date:

04 November 2019, 20:14:53

On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > True, it's not a situation you especially want to be in.  However,
> > I've lost count of the number of times that I've heard someone talk
> > about how their system was overstressed to the point that everything
> > else was failing, but Postgres kept chugging along.  That's a good
> > reputation to have and we shouldn't just walk away from it.
>
> I agree with this point in principle.  Everything else (queries,
> checkpointing) can fail, but it's critical that postmaster continues to
> run -- that way, once the high load episode is over, connections can be
> re-established as needed, auxiliary processes can be re-launched, and
> the system can be again working normally.  If postmaster dies, all bets
> are off.  Also: an idle postmaster is not using any resources; on its
> own, killing it or it dying would not free any useful resources for the
> system load to be back to low again.

Sure, I'm not arguing that the postmaster should blow up and die.

I was, however, arguing that if the postmaster fails to launch workers
for a parallel query due to process table exhaustion, it's OK for
*that query* to error out.

Tom finds that argument to be "utter bunkum," but I don't agree. I
think there might also be some implementation complexity there that is
more than meets the eye. If a process trying to register workers finds
out that no worker slots are available, it discovers this at the time
it tries to perform the registration. But fork() failure happens later
and in a different process. The original process just finds out that
the worker is "stopped," not whether or not it ever got started in the
first place. We certainly can't ignore a worker that managed to start
and then bombed out, because it might've already, for example, claimed
a block from a Parallel Seq Scan and not yet sent back the
corresponding tuples. We could ignore a worker that never started at
all, due to EAGAIN or whatever else, but the original process that
registered the worker has no way of finding this out.

Now you might think we could just fix that by having the postmaster
record something in the slot, but that doesn't work either, because
the slot could get reused before the original process checks the
status information. The fact that the slot has been reused is
sufficient evidence that the worker was unregistered, which means it
either stopped or we gave up on starting it, but it doesn't tell us
which one. To be able to tell that, we'd have to have a mechanism to
prevent slots from getting reused until any necessary exit status
information had bene read, sort of like the OS-level zombie process
mechanism (which we all love, I guess, and therefore definitely want
to reinvent...?). The postmaster logic would need to be made more
complicated, so that zombies couldn't accumulate: if a process asked
for status notifications, but then died, any zombies waiting for it
would need to be cleared. And you'd also have to make sure that a
process which didn't die was guaranteed to read the status from the
zombie to clear it, and that it did so in a reasonably timely fashion,
which is currently in no way guaranteed and does not appear at all
straightforward to guarantee.

And even if you solved for all of that, I think you might still find
that it breaks some parallel query (or parallel create index) code
that expects the number of workers to change at registration time, but
not afterwards. So, that could would all need to be adjusted.

In short, I think Tom wants a pony. But that does not mean we should
not fix this bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed check for too-many-children in bgworker spawning

From

Alvaro Herrera

Date:

04 November 2019, 20:29:50

On 2019-Nov-04, Robert Haas wrote:

> On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > True, it's not a situation you especially want to be in.  However,
> > > I've lost count of the number of times that I've heard someone talk
> > > about how their system was overstressed to the point that everything
> > > else was failing, but Postgres kept chugging along.  That's a good
> > > reputation to have and we shouldn't just walk away from it.
> >
> > I agree with this point in principle.  Everything else (queries,
> > checkpointing) can fail, but it's critical that postmaster continues to
> > run [...]
> 
> Sure, I'm not arguing that the postmaster should blow up and die.

I must have misinterpreted you, then.  But then I also misinterpreted
Tom, because I thought it was this stability problem that was "utter
bunkum".

> I was, however, arguing that if the postmaster fails to launch workers
> for a parallel query due to process table exhaustion, it's OK for
> *that query* to error out.

That position makes sense to me.  It would be nice [..ponies..] for the
query to run regardless, but if it doesn't, it's not such a big deal;
the query could have equally failed to run in a single process anyway.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missed check for too-many-children in bgworker spawning

From

Tom Lane

Date:

04 November 2019, 21:07:33

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Nov-04, Robert Haas wrote:
>> On Mon, Nov 4, 2019 at 10:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I agree with this point in principle.  Everything else (queries,
>>> checkpointing) can fail, but it's critical that postmaster continues to
>>> run [...]

>> Sure, I'm not arguing that the postmaster should blow up and die.

> I must have misinterpreted you, then.  But then I also misinterpreted
> Tom, because I thought it was this stability problem that was "utter
> bunkum".

I fixed the postmaster crash problem in commit 3887e9455.  The residual
issue that I think is entirely bogus is that the parallel query start
code will silently continue without workers if it hits our internal
resource limit of how many bgworker ProcArray slots there are, but
not do the same when it hits the external resource limit of the
kernel refusing to fork().  I grant that there might be implementation
reasons for that being difficult, but I reject Robert's apparent
opinion that it's somehow desirable to behave that way.  As things
stand, we have all of the disadvantages that you can't predict how
many workers you'll get, and none of the advantages of robustness
in the face of system resource exhaustion.

            regards, tom lane

Re: Missed check for too-many-children in bgworker spawning

From

Andres Freund

Date:

04 November 2019, 21:53:00

Hi,

On 2019-10-09 12:29:18 -0400, Robert Haas wrote:
> I would say rather that if fork() is failing on your system, you have
> a not very stable system.

I don't think that's really true, fwiw. It's often a good idea to turn
on strict memory overcommit accounting, and with that set, it's actually
fairly common to see fork() fail with ENOMEM, even if there's
practically a reasonable amount of resources. Especially with larger
shared buffers and without huge pages, the amount of memory needed for a
postmaster child in the worst case is not insubstantial.

> The fact that parallel query is going to fail is sad, but not as sad
> as the fact that connecting to the database is also going to fail, and
> that logging into the system to try to fix the problem may well fail
> as well.

Well, but parallel query also has to the potential to much more quickly
lead to a lot of new backends being started than you'd get new
connections on an analytics DB.

> Code that tries to make parallel query cope with this situation
> without an error wouldn't often be tested, so it might be buggy, and
> it wouldn't necessarily be a benefit if it did work. I expect many
> people would rather have the query fail and free up slots in the
> system process table than consume precisely all of them and then try
> to execute the query at a slower-than-expected rate.

I concede that you have a point here.

Greetings,

Andres Freund

Re: Missed check for too-many-children in bgworker spawning

From

Andres Freund

Date:

04 November 2019, 22:04:40

Hi,

On 2019-11-04 12:14:53 -0500, Robert Haas wrote:
> If a process trying to register workers finds out that no worker slots
> are available, it discovers this at the time it tries to perform the
> registration. But fork() failure happens later and in a different
> process. The original process just finds out that the worker is
> "stopped," not whether or not it ever got started in the first
> place.

Is that really true? In the case where it started and failed we except
the error queue to have been attached to, and there to be either an
error 'E' or a 'X' response (cf HandleParallelMessage()).  It doesn't
strike me as very complicated to keep track of whether any worker has
sent an 'E' or not, no?  I don't think we really need the

Funny (?) anecdote: I learned about this part of the system recently,
after I had installed some crash handler inside postgres. Turns out that
that diverted, as a side-effect, SIGUSR1 to it's own signal handler. All
tests in the main regression tests passed, except for ones getting stuck
waiting for WaitForParallelWorkersToFinish(), which could be fixed by
disabling parallelism aggressively. Took me like two hours to
debug... Also, a bit sad that parallel query is the only visible
failure (in the main tests) of breaking the sigusr1 infrastructure...

> We certainly can't ignore a worker that managed to start and
> then bombed out, because it might've already, for example, claimed a
> block from a Parallel Seq Scan and not yet sent back the corresponding
> tuples. We could ignore a worker that never started at all, due to
> EAGAIN or whatever else, but the original process that registered the
> worker has no way of finding this out.

Sure, but in that case we'd have gotten either an error back from the
worker, or postmaster wouldhave PANIC restarted everyone due to an
unhandled error in the worker, no?

> And even if you solved for all of that, I think you might still find
> that it breaks some parallel query (or parallel create index) code
> that expects the number of workers to change at registration time, but
> not afterwards. So, that could would all need to be adjusted.

Fair enough. Although I think practically nearly everything has to be
ready to handle workers just being slow to start up anyway, no? There's
plenty cases where we just finish before all workers are getting around
to do work.

Greetings,

Andres Freund

Re: Missed check for too-many-children in bgworker spawning

From

Stephen Frost

Date:

04 November 2019, 22:09:45

Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-10-09 12:29:18 -0400, Robert Haas wrote:
> > I would say rather that if fork() is failing on your system, you have
> > a not very stable system.
>
> I don't think that's really true, fwiw. It's often a good idea to turn
> on strict memory overcommit accounting, and with that set, it's actually
> fairly common to see fork() fail with ENOMEM, even if there's
> practically a reasonable amount of resources. Especially with larger
> shared buffers and without huge pages, the amount of memory needed for a
> postmaster child in the worst case is not insubstantial.

I've not followed this thread very closely, but I agree with Andres here
wrt fork() failing with ENOMEM in the field and not because the system
isn't stable.

Thanks,

Stephen

Attachment

signature.asc

Re: Missed check for too-many-children in bgworker spawning

From

Robert Haas

Date:

04 November 2019, 22:58:20

On Mon, Nov 4, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> Is that really true? In the case where it started and failed we except
> the error queue to have been attached to, and there to be either an
> error 'E' or a 'X' response (cf HandleParallelMessage()).  It doesn't
> strike me as very complicated to keep track of whether any worker has
> sent an 'E' or not, no?  I don't think we really need the

One of us is confused here, because I don't think that helps. Consider
three background workers Alice, Bob, and Charlie. Alice fails to
launch because fork() fails. Bob launches but then exits unexpectedly.
Charlie has no difficulties and carries out his assigned duties.

Now, the system you are proposing will say that Charlie is OK but
Alice and Bob are a problem. However, that's the way it already works.
What Tom wants is to distinguish Alice from Bob, and your proposal is
of no help at all with that problem, so far as I can see.

> > We certainly can't ignore a worker that managed to start and
> > then bombed out, because it might've already, for example, claimed a
> > block from a Parallel Seq Scan and not yet sent back the corresponding
> > tuples. We could ignore a worker that never started at all, due to
> > EAGAIN or whatever else, but the original process that registered the
> > worker has no way of finding this out.
>
> Sure, but in that case we'd have gotten either an error back from the
> worker, or postmaster wouldhave PANIC restarted everyone due to an
> unhandled error in the worker, no?

An unhandled ERROR in the worker is not a PANIC. I think it's just an
ERROR that ends up being fatal in effect, but even if it's actually
promoted to FATAL, it's not a PANIC.

It is *generally* true that if a worker hits an ERROR, the error will
be propagated back to the leader, but it is not an invariable rule.
One pretty common way that it fails to happen - common in the sense
that it comes up during development, not common on production systems
I hope - is if a worker dies before reaching the call to
pq_redirect_to_shm_mq(). Before that, there's no possibility of
communicating anything. Granted, at that point we shouldn't yet have
done any work that might mess up the query results.  Similarly, once
we reach that point, we are dependent on a certain amount of good
behavior for things to work as expected; yeah, any code that calls
proc_exit() is suppose to signal an ERROR or FATAL first, but what if
it doesn't? Granted, in that case we'd probably fail to send an 'X'
message, too, so the leader would still have a chance to realize
something is wrong.

I guess I agree with you to this extent: I made a policy decision that
if a worker is successfully fails to show up, that's an ERROR. It
would be possible to adopt the opposite policy, namely that if a
worker doesn't show up, that's an "oh well." You'd have to be very
certain that the worker wasn't going to show up later, though. For
instance, suppose you check all of the shared memory queues used for
returning tuples and find that every queue is either in a state where
(1) nobody's ever attached to it or (2) somebody attached and then
detached. This is not good enough, because it's possible that after
you checked queue #1, and found it in the former state, someone
attached and read a block, which caused queue #2 to enter the latter
state before you got around to checking it. If you decide that it's OK
to decide that we're done at this point, you'll never return the
tuples that are pushed through queue #1.

But, assuming you nailed the door shut so that such problems could not
occur, I think we could make a decision to ignore works that failed
before doing anything interesting. Whether that would be a good policy
decision is pretty questionable in my mind. In addition to what I
mentioned before, I think there's a serious danger that errors that
users would have really wanted to know about - or developers would
really want to have known about - would get ignored. You could have
some horrible problem that's making your workers fail to launch, and
the system would just carry on as if everything were fine, except with
bad query plans. I realize that you and others might say "oh, well,
monitor your logs, then," but I think there is certainly some value in
an ordinary user being able to know that things didn't go well without
having to look into the PostgreSQL log for errors. Now, maybe you
think that's not enough value to justify having it work the way it
does today, and I certainly respect that, but I don't view it that way
myself.

What I mostly want to emphasize here is that, while parallel query has
had a number of bugs in this area that were the result of shoddy
design or inadequate testing - principally by me - this isn't one of
them. This decision was made consciously by me because I thought it
gave us the best chance of having a system that would be reliable and
have satisfying behavior for users. Sounds like not everybody agrees,
and that's fine, but I just want to get it out there that this wasn't
accidental on my part.

> > And even if you solved for all of that, I think you might still find
> > that it breaks some parallel query (or parallel create index) code
> > that expects the number of workers to change at registration time, but
> > not afterwards. So, that could would all need to be adjusted.
>
> Fair enough. Although I think practically nearly everything has to be
> ready to handle workers just being slow to start up anyway, no? There's
> plenty cases where we just finish before all workers are getting around
> to do work.

Because of the shutdown race mentioned above, we generally have to
wait for workers to exit before we can shut down parallelism.  See
commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f (whose commit message also
documents some of the behaviors now in question). So we tolerate slow
startup in that it doesn't prevent us from getting started on query
execution, but not to the extent that we can finish query execution
without knowing definitively that every worker is either already gone
or will never be showing up.

(Is it possible to do better there? Perhaps. If we could somehow throw
up a brick wall to prevent new workers from doing anything that would
cause problems, then verify that every worker which got past the brick
wall has exited cleanly, then we could ignore the risk of more workers
showing up later, because they'd hit the brick wall before causing any
trouble.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed check for too-many-children in bgworker spawning

From

Andres Freund

Date:

04 November 2019, 23:41:16

Hi,

On 2019-11-04 14:58:20 -0500, Robert Haas wrote:
> On Mon, Nov 4, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> > Is that really true? In the case where it started and failed we except
> > the error queue to have been attached to, and there to be either an
> > error 'E' or a 'X' response (cf HandleParallelMessage()).  It doesn't
> > strike me as very complicated to keep track of whether any worker has
> > sent an 'E' or not, no?  I don't think we really need the
> 
> One of us is confused here, because I don't think that helps. Consider
> three background workers Alice, Bob, and Charlie. Alice fails to
> launch because fork() fails. Bob launches but then exits unexpectedly.
> Charlie has no difficulties and carries out his assigned duties.
> 
> Now, the system you are proposing will say that Charlie is OK but
> Alice and Bob are a problem. However, that's the way it already works.
> What Tom wants is to distinguish Alice from Bob, and your proposal is
> of no help at all with that problem, so far as I can see.

I don't see how what I'm saying treats Alice and Bob the same. What I'm
saying is that if a worker has been started and shut down, without
signalling an error via the error queue, and without an exit code that
causes postmaster to worry, then we can just ignore the worker for the
purpose of determining whether the query succeeded. Without a meaningful
loss in reliably. And we can detect such a cases easily, we already do,
we just have remove an ereport(), and document things.


> > > We certainly can't ignore a worker that managed to start and
> > > then bombed out, because it might've already, for example, claimed a
> > > block from a Parallel Seq Scan and not yet sent back the corresponding
> > > tuples. We could ignore a worker that never started at all, due to
> > > EAGAIN or whatever else, but the original process that registered the
> > > worker has no way of finding this out.
> >
> > Sure, but in that case we'd have gotten either an error back from the
> > worker, or postmaster wouldhave PANIC restarted everyone due to an
> > unhandled error in the worker, no?
> 
> An unhandled ERROR in the worker is not a PANIC. I think it's just an
> ERROR that ends up being fatal in effect, but even if it's actually
> promoted to FATAL, it's not a PANIC.

If it's an _exit without going through the PG machinery, it'll
eventually be PANIC, albeit with a slight delay. And once we're actually
executing the parallel query, we better have error reporting set up for
parallel queries.


> It is *generally* true that if a worker hits an ERROR, the error will
> be propagated back to the leader, but it is not an invariable rule.
> One pretty common way that it fails to happen - common in the sense
> that it comes up during development, not common on production systems
> I hope - is if a worker dies before reaching the call to
> pq_redirect_to_shm_mq(). Before that, there's no possibility of
> communicating anything. Granted, at that point we shouldn't yet have
> done any work that might mess up the query results.

Right.


> Similarly, once we reach that point, we are dependent on a certain amount of good
> behavior for things to work as expected; yeah, any code that calls
> proc_exit() is suppose to signal an ERROR or FATAL first, but what if
> it doesn't? Granted, in that case we'd probably fail to send an 'X'
> message, too, so the leader would still have a chance to realize
> something is wrong.

I mean, in that case so many more things are screwed up, I don't buy
that it's worth pessimizing ENOMEM handling for this. And if you're
really concerned, we could add before_shmem_exit hook or such that makes
extra double sure that we've signalled something.


> I guess I agree with you to this extent: I made a policy decision that
> if a worker is successfully fails to show up, that's an ERROR. It
> would be possible to adopt the opposite policy, namely that if a
> worker doesn't show up, that's an "oh well." You'd have to be very
> certain that the worker wasn't going to show up later, though. For
> instance, suppose you check all of the shared memory queues used for
> returning tuples and find that every queue is either in a state where
> (1) nobody's ever attached to it or (2) somebody attached and then
> detached. This is not good enough, because it's possible that after
> you checked queue #1, and found it in the former state, someone
> attached and read a block, which caused queue #2 to enter the latter
> state before you got around to checking it. If you decide that it's OK
> to decide that we're done at this point, you'll never return the
> tuples that are pushed through queue #1.

That's why the code *already* waits for workers to attach, or for the
slot to be marked unused/invalid/reused. I don't see how that applies to
not explicitly erroring out when we know that the worker *failed* to
start:

void
WaitForParallelWorkersToFinish(ParallelContext *pcxt)
...


            /*
             * We didn't detect any living workers, but not all workers are
             * known to have exited cleanly.  Either not all workers have
             * launched yet, or maybe some of them failed to start or
             * terminated abnormally.
             */
            for (i = 0; i < pcxt->nworkers_launched; ++i)
            {
                pid_t        pid;
                shm_mq       *mq;

                /*
                 * If the worker is BGWH_NOT_YET_STARTED or BGWH_STARTED, we
                 * should just keep waiting.  If it is BGWH_STOPPED, then
                 * further investigation is needed.
                 */
                if (pcxt->worker[i].error_mqh == NULL ||
                    pcxt->worker[i].bgwhandle == NULL ||
                    GetBackgroundWorkerPid(pcxt->worker[i].bgwhandle,
                                           &pid) != BGWH_STOPPED)
                    continue;

                /*
                 * Check whether the worker ended up stopped without ever
                 * attaching to the error queue.  If so, the postmaster was
                 * unable to fork the worker or it exited without initializing
                 * properly.  We must throw an error, since the caller may
                 * have been expecting the worker to do some work before
                 * exiting.
                 */
                mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
                if (shm_mq_get_sender(mq) == NULL)
                    ereport(ERROR,
                            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                             errmsg("parallel worker failed to initialize"),
                             errhint("More details may be available in the server log.")));

                /*
                 * The worker is stopped, but is attached to the error queue.
                 * Unless there's a bug somewhere, this will only happen when
                 * the worker writes messages and terminates after the
                 * CHECK_FOR_INTERRUPTS() near the top of this function and
                 * before the call to GetBackgroundWorkerPid().  In that case,
                 * or latch should have been set as well and the right things
                 * will happen on the next pass through the loop.
                 */
            }
        }



> But, assuming you nailed the door shut so that such problems could not
> occur, I think we could make a decision to ignore works that failed
> before doing anything interesting. Whether that would be a good policy
> decision is pretty questionable in my mind. In addition to what I
> mentioned before, I think there's a serious danger that errors that
> users would have really wanted to know about - or developers would
> really want to have known about - would get ignored. You could have
> some horrible problem that's making your workers fail to launch, and
> the system would just carry on as if everything were fine, except with
> bad query plans. I realize that you and others might say "oh, well,
> monitor your logs, then," but I think there is certainly some value in
> an ordinary user being able to know that things didn't go well without
> having to look into the PostgreSQL log for errors. Now, maybe you
> think that's not enough value to justify having it work the way it
> does today, and I certainly respect that, but I don't view it that way
> myself.

Yea, this is somewhat of a pickle. I'm inclined to think that the
problem of unnecessarily ERRORing out queries is worse than the disease
(causing unnecessary failures to make debugging of some not all that
likely errors easier is somewhat a severe measure).


I think I mentioned this to you on chat, but I think the in-core use of
bgworkers, at least as they currently are designed, is/was an
architecturally bad idea. There's numerous problems flowing from
that, with error handling being one big recurring theme.


Greetings,

Andres Freund